ADR-0023: SPARQL Federation and Instance Mesh¶
- Status: Accepted (2026-04-19)
- Date: 2026-04-17
Context¶
Trails instances currently run standalone: one local Oxigraph store, one set of capabilities, one Cedar policy context. This is correct for single-app development but insufficient for the cross-org, cross-instance scenarios that knowledge-graph apps inevitably encounter:
- Healthcare: a hospital's evidence graph queries a research institution's publication graph for drug-interaction data.
- Compliance: an auditor's instance pulls regulatory change events from a regulator's instance without bulk-importing the dataset.
- Research: a literature review tool federates across three domain-specific KG instances, each maintained by a different team.
All of these require querying or invoking capabilities across instance boundaries. Today, users must manually export, copy, and re-import data — defeating the purpose of a live knowledge graph.
Two established mechanisms exist for this:
- SPARQL 1.1 Federated Query — the W3C
SERVICEkeyword routes sub-queries to remote SPARQL endpoints. Well-specified, widely implemented, zero application-layer code required for basic cases. - MCP capability invocation — Trails already uses MCP as primary transport (ADR-0008). Capability discovery and invocation across instances is a natural extension of MCP's design.
Neither mechanism alone is sufficient. SPARQL federation handles declarative graph queries but cannot invoke procedural capabilities. MCP capability relay handles procedure calls but is not the right tool for ad-hoc graph traversal. A practical multi-instance story requires both, plus a discovery layer to find peers.
Decision¶
Trails supports cross-instance operation through three complementary layers, adopted incrementally via progressive enhancement (ADR-0021).
Layer 1: SPARQL Federation — trails.federation¶
Each Trails instance can expose a read-only SPARQL endpoint at
/sparql (disabled by default, enabled in trails.toml). Remote
instances are queried using the standard SERVICE keyword in SPARQL 1.1
Federated Query:
SELECT ?drug ?interaction WHERE {
?drug :name "Aspirin" .
SERVICE <https://pharma.example/sparql> {
?drug :interactsWith ?interaction .
}
}
Key constraints:
- Read-only. Federation endpoints do not accept
INSERT,DELETE, orUPDATEoperations. Write paths require explicit capability invocation (Layer 2). - Cedar-gated. Every federated query is evaluated against the instance's Cedar policies before execution. The requesting principal's identity (DID, biscuit token) is forwarded; the remote instance evaluates its own policies. No data leaks to unauthorized peers. Cedar policy evaluation applies per ADR-0006.
- Cost-enveloped. Federated sub-queries carry a cost envelope
(ADR-0012). The local
CostAccountanttracks remote query cost as part of the parent envelope. Remote instances report actual cost in response headers. Cost envelopes span federation boundaries per ADR-0012a. - Timeout-bounded. Remote
SERVICEcalls inherit the local query's wall-clock timeout (ADR-0007 update). A slow or unresponsive peer times out; the local query fails gracefully with a diagnostic, not a hang. - Query rewriting. The
trails.federationmodule rewrites outgoing queries to strip graph patterns that the remote peer's published capabilities do not cover, avoiding wasted round-trips.
Layer 2: MCP Capability Relay¶
When a local capability needs to invoke a remote capability (not just query data), it uses MCP's existing invocation protocol:
@capability
async def cross_check(ctx, drug_id: str) -> dict:
# Invoke a capability on a remote Trails instance
result = await ctx.remote("pharma-instance").invoke(
"check_interactions", drug_id=drug_id
)
return result
- Discovery uses
trails registrywith peer URLs configured intrails.tomlor discovered via Layer 3. Remote capability manifests are fetched from/.well-known/capabilities(ADR-0005) and cached locally with TTL-based invalidation. - Authentication uses the same biscuit + DID mechanism as local invocations (ADR-0010, ADR-0011). The local instance presents its DID and a scoped biscuit to the remote instance.
- Cost attribution flows through: the remote invocation's cost is added to the local capability's cost envelope.
- Provenance records the cross-instance call as a
prov:Delegationin the local provenance graph (ADR-0009), linking local and remote activity IRIs.
Layer 3: Instance Mesh — Peer Discovery and Routing¶
Peer discovery is the substrate that Layers 1 and 2 build on:
- Static config (Level 2): peers listed in
trails.tomlunder[federation.peers]. Simplest, works everywhere.
[federation.peers.pharma]
url = "https://pharma.example"
trust = "verified" # or "provisional"
[federation.peers.regulatory]
url = "https://reg.example"
trust = "verified"
- DNS-SD / mDNS (Level 4): for local-network and dev scenarios,
instances announce themselves via DNS Service Discovery.
trails dev --meshenables mDNS announcement and peer scanning on the local subnet. - Health monitoring. Peers are health-checked at configurable
intervals. Unhealthy peers are removed from the active peer set and
re-added when they recover. Health status is exposed in
trails status. - Consistent hashing for query routing. When multiple peers hold
overlapping data (e.g., partitioned by named graph), the mesh layer
uses consistent hashing on the graph IRI to route
SERVICEcalls to the peer most likely to hold the relevant partition. This is a routing optimization, not a consistency guarantee.
Progressive Enhancement Levels¶
Per ADR-0021, federation is additive. Existing standalone behavior is never broken.
| Level | Capability | Config required |
|---|---|---|
| 0 | Standalone (today's default) | None |
| 1 | Expose read-only SPARQL endpoint | [federation] endpoint = true |
| 2 | SERVICE queries to known peers |
[federation.peers] in trails.toml |
| 3 | MCP capability relay to remote instances | [federation.peers] + remote capability discovery |
| 4 | Mesh discovery + automatic peer management | [federation] mesh = true + mDNS/DNS-SD |
Each level is a strict superset of the previous. An instance at Level 0 is unaffected by federation features existing in the codebase.
Consequences¶
Positive¶
- Standard-based. SPARQL 1.1 Federation is a W3C standard; no proprietary query protocol. Any SPARQL-compliant endpoint (not just Trails) can participate as a federation peer.
- Policy-respecting by default. Cedar gates every cross-instance data flow. No "federation bypasses auth" footgun.
- Cost-aware. Federation does not create an unobservable cost amplifier. Remote query costs are tracked and budgeted.
- Progressive. Standalone instances pay zero overhead. Federation is opt-in at each level.
- Composable with existing stack. Uses MCP (ADR-0008) for capability relay, Cedar (ADR-0006) for policy, cost envelopes (ADR-0012/0012a) for budgets, PROV-O (ADR-0009) for provenance. No new trust primitives required.
Negative¶
- Latency. Federated queries add network round-trips. Sub-query latency is visible in cost envelopes but not eliminable. Mitigation: query planners can prefer local data; caching of remote results is possible (with TTL and cache-invalidation complexity).
- Partial failure. A query spanning three peers may succeed on two
and timeout on one. The framework must define failure semantics: fail
the whole query, or return partial results with diagnostics.
Recommendation: default to fail-closed (consistent), with an opt-in
SERVICE SILENTmodifier for fail-open (best-effort), matching the SPARQL 1.1 spec. - Trust complexity. Peer trust is bilateral — each instance must configure which peers it trusts and at what level. This is inherent to cross-org data sharing, not introduced by this ADR, but the framework must surface it clearly in config and docs.
- Debugging difficulty. Distributed query plans are harder to
inspect than local ones. Mitigation:
trails explainoutputs a federated query plan showing which sub-queries route where, with estimated cost per peer.
Non-consequences¶
- Not a distributed database. There is no RAFT, Paxos, or consensus protocol. Each instance owns its data. Federation is read-only query routing and capability invocation, not replication.
- No automatic replication. Data does not sync between instances. If replication is needed, that is a separate ADR (event-sourced replication or CRDT-based merge).
- No breaking change to standalone behavior. An instance that does not configure federation behaves identically to today.
- Not a replacement for bulk import. Federation is for live,
policy-gated, cross-instance queries. Bulk data migration remains a
separate concern (
trails import/trails export).
Revisit conditions¶
- If SPARQL federation performance proves inadequate for latency- sensitive use cases, evaluate a binary query protocol (e.g., Apache Arrow Flight for result transfer) alongside SPARQL for query expression.
- If MCP capability relay becomes unwieldy for high-frequency cross-instance calls, evaluate gRPC or a persistent connection protocol as an alternative transport.
- If peer trust management becomes a user-experience bottleneck, evaluate integration with a trust registry or Web-of-Trust discovery (ADR-0015, ADR-0016).
Alternatives considered¶
-
Full distributed database (RAFT/consensus). Rejected. Massively increases complexity, operational burden, and failure modes. Trails is a framework for apps, not a distributed systems kernel. Users who need distributed storage can run Oxigraph behind a sharding proxy or use a managed graph database.
-
GraphQL federation (Apollo-style). Rejected. GraphQL federation is designed for schema-stitching across microservices, not for ad-hoc graph traversal. SPARQL federation is the native query federation mechanism for RDF stores; using GraphQL would require translating between query paradigms at every boundary.
-
Custom peer-to-peer protocol. Rejected. Inventing a new protocol adds specification burden, interop friction, and maintenance cost. SPARQL federation is standardized; MCP is the framework's primary transport. Both are sufficient.
-
Event-sourced replication (CRDT / event log). Deferred, not rejected. Replication solves a different problem (data durability and offline-first) than federation (live cross-instance queries). A future ADR can add replication as an orthogonal feature.
-
Federation via a central hub/broker. Rejected. A central broker creates a single point of failure, a trust bottleneck, and an operational dependency. Peer-to-peer federation with static or discovered peers is more resilient and simpler to operate.
Dependencies¶
| ADR | Relationship |
|---|---|
| ADR-0006 (Cedar policy) | Cedar must gate all cross-instance queries and capability invocations. Federation policies are Cedar policies. |
| ADR-0007 (Oxigraph default) | SPARQL endpoint exposes Oxigraph's query evaluator. Timeout and memory bounds from ADR-0007 apply to federated sub-queries served locally. |
| ADR-0008 (MCP primary transport) | MCP capability relay (Layer 2) extends the existing MCP transport to cross-instance invocation. |
| ADR-0009 (Provenance always on) | Cross-instance calls are recorded as prov:Delegation in the provenance graph. |
| ADR-0010 (Biscuit tokens) | Authentication for cross-instance requests uses scoped biscuit tokens. |
| ADR-0011 (DID identity) | Instance identity in the mesh is DID-based. |
| ADR-0012 / 0012a (Cost envelopes) | Cost envelopes span federation boundaries. Remote costs are attributed to the originating envelope. |
| ADR-0021 (Progressive enhancement) | Federation levels follow the additive, no-decision-upfront principle. |