Chapter 7 — Federation and Scaling¶
Learning objectives¶
After this chapter you will be able to:
- Explain the difference between standalone and federated Trails apps.
- Expose a read-only SPARQL endpoint from your instance.
- Configure peers and run cross-instance SERVICE queries.
- Relay capabilities to remote instances via MCP.
- Set up health monitoring with the instance mesh.
- Deploy Trails with Docker and docker-compose.
- Choose the right store backend for your read/write profile.
Standalone vs. federated¶
Every Trails app starts standalone: one instance, one KG, one process.
Federation is purely additive -- you opt in by adding a [federation]
section to trails.toml. Nothing changes for apps that do not enable
it.
When federation is enabled, your instance can:
- Expose a read-only SPARQL endpoint that other instances (or any SPARQL client) can query.
- Query remote instances using standard SPARQL
SERVICEblocks. - Invoke remote capabilities via MCP protocol relay.
- Monitor peer health and discover new instances automatically.
Exposing a SPARQL endpoint¶
Enable federation in trails.toml:
[federation]
enabled = true
read_only = true # always true — no writes through federation
max_query_time_ms = 30000 # 30-second timeout per query
Mount the endpoint on a FastAPI app:
from fastapi import FastAPI
from trails.federation import FederationConfig, FederationEndpoint
from trails.federation_http import mount_federation_routes
app = FastAPI()
config = FederationConfig(enabled=True, read_only=True)
endpoint = FederationEndpoint(store=ctx.kg._store, config=config)
mount_federation_routes(app, endpoint)
This registers three HTTP routes:
| Method | Path | Description |
|---|---|---|
GET |
/sparql?query=... |
URL-encoded SPARQL query |
POST |
/sparql |
Query in request body (application/sparql-query) |
GET |
/sparql/status |
Endpoint status and config |
The endpoint accepts SELECT, ASK, CONSTRUCT, and DESCRIBE
queries. All write operations (INSERT, DELETE, UPDATE) are
rejected at the lexical level before the query reaches the store.
Every response carries an X-Trails-Trace-Id header for provenance
correlation across instances.
Policy gating¶
Federated queries pass through Cedar policy evaluation. The requesting
principal is read from the X-Trails-Principal HTTP header:
permit(
principal,
action == Action::"sparql_query",
resource == Trails::Federation::Endpoint::"sparql"
) when {
principal != "anonymous"
};
When no policies are configured, queries execute without authorization checks. When a policy denies a request, the endpoint returns HTTP 403.
Configuring peers¶
Declare remote instances in trails.toml:
[federation.peers.pharma]
url = "https://pharma.example/sparql"
mcp_url = "https://pharma.example/mcp" # optional — for capability relay
label = "Pharma Knowledge Graph"
trust = "verified"
timeout_ms = 30000
[federation.peers.regulatory]
url = "https://reg.example/sparql"
label = "Regulatory KG"
trust = "verified"
timeout_ms = 15000
Peer names (pharma, regulatory) are used as identifiers in SERVICE
queries, MCP relay calls, and CLI commands.
SERVICE queries across instances¶
The FederatedQueryEngine rewrites SPARQL queries that contain
SERVICE blocks. For each SERVICE URL it finds a matching peer, sends
the sub-query over HTTP, and merges the remote bindings with local
results.
from trails.federation import FederatedQueryEngine
peers = {
"pharma": {"url": "https://pharma.example/sparql", "timeout_ms": 30000},
}
engine = FederatedQueryEngine(peers=peers, store=ctx.kg._store)
result = engine.execute(ctx, """
PREFIX : <https://myapp.example/>
SELECT ?drug ?interaction WHERE {
?drug :name "Aspirin" .
SERVICE <https://pharma.example/sparql> {
?drug :interactsWith ?interaction .
}
}
""")
for binding in result["results"]["bindings"]:
print(binding["drug"]["value"], binding["interaction"]["value"])
How SERVICE rewriting works¶
- The engine extracts all
SERVICE <url> { ... }blocks. - Each URL is resolved against configured peers (unknown URLs raise
FederationQueryError). - The SERVICE body is wrapped in
SELECT * WHERE { ... }and sent to the remote peer via HTTP POST. - The local query (SERVICE blocks removed) executes against the local store.
- Local and remote bindings are merged via compatible-variable join.
Remote queries carry the X-Trails-Principal header so the remote
instance can evaluate its own Cedar policies.
Cost tracking¶
Every remote SERVICE dispatch records a "federation:remote_query" cost
entry with measured latency:
MCP capability relay¶
While SERVICE queries read data from remote graphs, MCP relay invokes capabilities on remote instances. This lets you compose workflows across multiple Trails apps.
Configuration¶
Add mcp_url to any peer that exposes an MCP endpoint:
[federation.peers.warehouse]
url = "https://warehouse.internal:8000/sparql"
mcp_url = "https://warehouse.internal:8000/mcp"
label = "Data Warehouse"
timeout_ms = 30000
The relay() helper¶
Inside a @capability body, use relay() to invoke a remote tool:
from trails import capability
from trails.federation_mcp import relay
@capability("aggregate_report")
async def aggregate(ctx):
result = await relay(
ctx,
peer="warehouse",
tool="summarize_sales",
arguments={"year": 2026},
)
return result
relay() resolves the peer from trails.toml, sends a JSON-RPC 2.0
tools/call request, tracks cost as "federation:mcp_relay", and
attaches PROV-O provenance linking local and remote trace IDs.
Discovering remote capabilities¶
Before invoking, list what a peer offers:
from trails.federation_mcp import MCPRelayClient
client = MCPRelayClient(
federation_config=my_config,
principal="my-instance",
)
tools = client.list_tools("warehouse")
for t in tools:
print(t["name"], t.get("description", ""))
Cedar policy for relay¶
MCP relay uses action == Action::"mcp_relay" with a resource of
Trails::Federation::MCPRelay::"<peer>/<tool>":
permit(
principal == User::"clinical-app",
action == Action::"mcp_relay",
resource == Trails::Federation::MCPRelay::"warehouse/summarize_sales"
);
Instance mesh: health monitoring and peer discovery¶
The MeshManager provides continuous health monitoring for all
configured peers. This sits below the query engine and relay client,
enabling health-aware routing and graceful degradation.
Health checks¶
from trails.federation_mesh import MeshManager
mesh = MeshManager(federation_config=my_config, timeout_s=5.0)
# Check a single peer
health = mesh.health_check("pharma")
print(f"{health.name}: {health.status} ({health.latency_ms}ms)")
# Check all peers
statuses = mesh.health_check_all()
for name, h in statuses.items():
print(f"{name}: {h.status}")
Health status is one of "healthy", "degraded" (some endpoints up),
or "unreachable". A peer that fails 5 consecutive checks is
soft-removed and excluded from query routing until it recovers.
Peer discovery¶
discovered = mesh.discover_peers()
for peer in discovered:
print(f"{peer['name']} ({peer['source']}): {peer['url']}")
DNS-SD queries _trails._tcp.local for automatic discovery on local
networks. Static config is always the primary source.
Background monitoring¶
Run periodic health checks in a daemon thread:
from trails.federation_mesh import MeshMonitor
monitor = MeshMonitor(mesh_manager=mesh, interval_seconds=60)
monitor.start()
# Your app runs...
print(f"Soft-removed peers: {mesh.soft_removed_peers}")
monitor.stop()
Health-aware query routing¶
Pass a MeshManager to FederatedQueryEngine for automatic graceful
degradation:
SERVICE blocks targeting unreachable peers are skipped with a warning (partial results instead of failure).
CLI commands¶
# Check health of all configured peers
trails federation status
# Discover peers (static + DNS-SD)
trails federation discover
# Ping a specific peer
trails federation ping pharma
Deployment¶
Docker¶
A minimal Dockerfile for a Trails app:
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker-compose: two federated instances¶
version: "3.9"
services:
research:
build: ./research-app
ports:
- "8001:8000"
environment:
- TRAILS_FEDERATION_ENABLED=true
volumes:
- research-data:/app/data
clinical:
build: ./clinical-app
ports:
- "8002:8000"
environment:
- TRAILS_FEDERATION_ENABLED=true
volumes:
- clinical-data:/app/data
volumes:
research-data:
clinical-data:
The research app's trails.toml:
[project]
name = "research-pubs"
[federation]
enabled = true
max_query_time_ms = 10000
[federation.peers.clinical]
url = "http://clinical:8000/sparql"
label = "Clinical Evidence"
timeout_ms = 10000
The clinical app's trails.toml:
[project]
name = "clinical-evidence"
[federation]
enabled = true
max_query_time_ms = 10000
[federation.peers.research]
url = "http://research:8000/sparql"
label = "Research Publications"
timeout_ms = 10000
Helm (Kubernetes)¶
For Kubernetes deployments, a Helm chart packages each Trails instance
as a Deployment + Service + ConfigMap (for trails.toml). The key
considerations:
- Each instance is a separate Deployment with its own
trails.toml. - Peer URLs use Kubernetes service DNS (e.g.,
http://research-pubs.default.svc:8000/sparql). - Health probes point at
/sparql/status. - Persistent volumes for the KG store and vector DB.
Performance tips¶
Choosing the right store backend¶
Trails supports Oxigraph (default, in-process) and Qlever (external, optimized for reads) as KG store backends:
| Backend | Best for | Trade-off |
|---|---|---|
| Oxigraph | Write-heavy workloads, development, small-to-medium graphs | Fast writes, ACID transactions; query speed degrades on large graphs |
| Qlever | Read-heavy production, large graphs (millions of triples) | Exceptional query speed; writes require re-indexing |
Rule of thumb: Start with Oxigraph (zero ops, embedded). Move to Qlever when query latency on production data becomes a bottleneck.
Federation performance¶
- Set per-peer
timeout_msto avoid slow peers blocking your queries. - Use
MeshManagerwith health-aware routing to skip unreachable peers. - Keep SERVICE blocks focused -- send the smallest possible sub-query to the remote peer and do joins locally.
- Monitor with
MeshMonitorand checkfederation:remote_querycost entries for latency trends.
Vector search performance¶
- Use
SqliteVecStorefor development and small datasets (<100k chunks). - Switch to
QdrantStorefor production scale. - Use
SentenceTransformerEmbedder(local) to avoid API latency and cost. - Prefer hybrid retrieval (
mode="hybrid") to get precision from SPARQL and recall from vectors.
Example: two Trails instances sharing data¶
This complete example shows two instances -- a research publication graph and a clinical evidence graph -- querying each other.
Instance A: research publications¶
# research_app.py
from trails import capability, node_type
@node_type("Paper", fields={"title": str, "doi": str, "year": int})
class Paper: ...
@capability("load_papers")
def load_papers(ctx) -> dict:
ctx.kg.add(Paper(title="Drug X interactions", doi="10.1234/abc", year=2025))
ctx.kg.add(Paper(title="Drug Y meta-analysis", doi="10.1234/def", year=2026))
return {"loaded": 2}
Instance B: clinical evidence¶
# clinical_app.py
from trails.federation import FederatedQueryEngine
peers = {
"research": {
"url": "http://research:8000/sparql",
"timeout_ms": 10000,
},
}
engine = FederatedQueryEngine(peers=peers, store=ctx.kg._store)
# Query papers from Instance A using SERVICE
result = engine.execute(ctx, """
SELECT ?title ?doi WHERE {
SERVICE <http://research:8000/sparql> {
?p a <trails://research-pubs/Paper> ;
<trails://research-pubs/Paper/title> ?title ;
<trails://research-pubs/Paper/doi> ?doi .
}
}
""")
for binding in result["results"]["bindings"]:
print(binding["title"]["value"], binding["doi"]["value"])
Adding MCP relay¶
Instance B can also invoke capabilities on Instance A:
from trails import capability
from trails.federation_mcp import relay
@capability("cross_reference")
async def cross_reference(ctx, drug_name: str):
# Invoke a capability on the research instance
papers = await relay(
ctx,
peer="research",
tool="search_papers",
arguments={"query": drug_name},
)
# Combine with local clinical data
local = ctx.kg.query(f"""
SELECT ?trial WHERE {{
?trial a <trails://clinical/Trial> ;
<trails://clinical/Trial/drug> "{drug_name}" .
}}
""")
return {
"papers": papers,
"trials": [r["trial"] for r in local],
}
Deep dives¶
- Federation guide -- full protocol spec, SERVICE rewriting internals, security considerations, cost tracking.
- Policy guide -- Cedar policy syntax for federation access control.
- MCP Integration guide -- the transport layer that capability relay builds on.
- Observability guide -- cost tracking across federated queries.
What's next: You have now covered the full Trails surface -- from
your first @capability (Chapter 1) through the ORM and knowledge
graph (Chapters 2-4), agentic planning (Chapter 5), data integration
(Chapter 6), and federation (Chapter 7). For the detailed API reference,
see the guide index and ADR index. For
real-world patterns, check the examples/ directory in the repository.