Chapter 7 — Federation and Scaling¶

Learning objectives¶

After this chapter you will be able to:

Explain the difference between standalone and federated Trails apps.
Expose a read-only SPARQL endpoint from your instance.
Configure peers and run cross-instance SERVICE queries.
Relay capabilities to remote instances via MCP.
Set up health monitoring with the instance mesh.
Deploy Trails with Docker and docker-compose.
Choose the right store backend for your read/write profile.

Standalone vs. federated¶

Every Trails app starts standalone: one instance, one KG, one process. Federation is purely additive -- you opt in by adding a [federation] section to trails.toml. Nothing changes for apps that do not enable it.

When federation is enabled, your instance can:

Expose a read-only SPARQL endpoint that other instances (or any SPARQL client) can query.
Query remote instances using standard SPARQL SERVICE blocks.
Invoke remote capabilities via MCP protocol relay.
Monitor peer health and discover new instances automatically.

standalone:     [App] ──── [KG]

federated:      [App A] ──── [KG A]
                   │ SERVICE / relay
                [App B] ──── [KG B]

Exposing a SPARQL endpoint¶

Enable federation in trails.toml:

[federation]
enabled = true
read_only = true            # always true — no writes through federation
max_query_time_ms = 30000   # 30-second timeout per query

Mount the endpoint on a FastAPI app:

from fastapi import FastAPI
from trails.federation import FederationConfig, FederationEndpoint
from trails.federation_http import mount_federation_routes

app = FastAPI()
config = FederationConfig(enabled=True, read_only=True)
endpoint = FederationEndpoint(store=ctx.kg._store, config=config)
mount_federation_routes(app, endpoint)

This registers three HTTP routes:

Method	Path	Description
`GET`	`/sparql?query=...`	URL-encoded SPARQL query
`POST`	`/sparql`	Query in request body (`application/sparql-query`)
`GET`	`/sparql/status`	Endpoint status and config

The endpoint accepts SELECT, ASK, CONSTRUCT, and DESCRIBE queries. All write operations (INSERT, DELETE, UPDATE) are rejected at the lexical level before the query reaches the store.

Every response carries an X-Trails-Trace-Id header for provenance correlation across instances.

Policy gating¶

Federated queries pass through Cedar policy evaluation. The requesting principal is read from the X-Trails-Principal HTTP header:

permit(
  principal,
  action == Action::"sparql_query",
  resource == Trails::Federation::Endpoint::"sparql"
) when {
  principal != "anonymous"
};

When no policies are configured, queries execute without authorization checks. When a policy denies a request, the endpoint returns HTTP 403.

Configuring peers¶

Declare remote instances in trails.toml:

[federation.peers.pharma]
url = "https://pharma.example/sparql"
mcp_url = "https://pharma.example/mcp"    # optional — for capability relay
label = "Pharma Knowledge Graph"
trust = "verified"
timeout_ms = 30000

[federation.peers.regulatory]
url = "https://reg.example/sparql"
label = "Regulatory KG"
trust = "verified"
timeout_ms = 15000

Peer names (pharma, regulatory) are used as identifiers in SERVICE queries, MCP relay calls, and CLI commands.

SERVICE queries across instances¶

The FederatedQueryEngine rewrites SPARQL queries that contain SERVICE blocks. For each SERVICE URL it finds a matching peer, sends the sub-query over HTTP, and merges the remote bindings with local results.

from trails.federation import FederatedQueryEngine

peers = {
    "pharma": {"url": "https://pharma.example/sparql", "timeout_ms": 30000},
}
engine = FederatedQueryEngine(peers=peers, store=ctx.kg._store)

result = engine.execute(ctx, """
    PREFIX : <https://myapp.example/>

    SELECT ?drug ?interaction WHERE {
      ?drug :name "Aspirin" .
      SERVICE <https://pharma.example/sparql> {
        ?drug :interactsWith ?interaction .
      }
    }
""")

for binding in result["results"]["bindings"]:
    print(binding["drug"]["value"], binding["interaction"]["value"])

How SERVICE rewriting works¶

The engine extracts all SERVICE <url> { ... } blocks.
Each URL is resolved against configured peers (unknown URLs raise FederationQueryError).
The SERVICE body is wrapped in SELECT * WHERE { ... } and sent to the remote peer via HTTP POST.
The local query (SERVICE blocks removed) executes against the local store.
Local and remote bindings are merged via compatible-variable join.

Remote queries carry the X-Trails-Principal header so the remote instance can evaluate its own Cedar policies.

Cost tracking¶

Every remote SERVICE dispatch records a "federation:remote_query" cost entry with measured latency:

engine = FederatedQueryEngine(
    peers=peers,
    store=ctx.kg._store,
    cost_tracker=my_tracker,
)

MCP capability relay¶

While SERVICE queries read data from remote graphs, MCP relay invokes capabilities on remote instances. This lets you compose workflows across multiple Trails apps.

Configuration¶

Add mcp_url to any peer that exposes an MCP endpoint:

[federation.peers.warehouse]
url = "https://warehouse.internal:8000/sparql"
mcp_url = "https://warehouse.internal:8000/mcp"
label = "Data Warehouse"
timeout_ms = 30000

The `relay()` helper¶

Inside a @capability body, use relay() to invoke a remote tool:

from trails import capability
from trails.federation_mcp import relay

@capability("aggregate_report")
async def aggregate(ctx):
    result = await relay(
        ctx,
        peer="warehouse",
        tool="summarize_sales",
        arguments={"year": 2026},
    )
    return result

relay() resolves the peer from trails.toml, sends a JSON-RPC 2.0 tools/call request, tracks cost as "federation:mcp_relay", and attaches PROV-O provenance linking local and remote trace IDs.

Discovering remote capabilities¶

Before invoking, list what a peer offers:

from trails.federation_mcp import MCPRelayClient

client = MCPRelayClient(
    federation_config=my_config,
    principal="my-instance",
)

tools = client.list_tools("warehouse")
for t in tools:
    print(t["name"], t.get("description", ""))

Cedar policy for relay¶

MCP relay uses action == Action::"mcp_relay" with a resource of Trails::Federation::MCPRelay::"<peer>/<tool>":

permit(
  principal == User::"clinical-app",
  action == Action::"mcp_relay",
  resource == Trails::Federation::MCPRelay::"warehouse/summarize_sales"
);

Instance mesh: health monitoring and peer discovery¶

The MeshManager provides continuous health monitoring for all configured peers. This sits below the query engine and relay client, enabling health-aware routing and graceful degradation.

Health checks¶

from trails.federation_mesh import MeshManager

mesh = MeshManager(federation_config=my_config, timeout_s=5.0)

# Check a single peer
health = mesh.health_check("pharma")
print(f"{health.name}: {health.status} ({health.latency_ms}ms)")

# Check all peers
statuses = mesh.health_check_all()
for name, h in statuses.items():
    print(f"{name}: {h.status}")

Health status is one of "healthy", "degraded" (some endpoints up), or "unreachable". A peer that fails 5 consecutive checks is soft-removed and excluded from query routing until it recovers.

Peer discovery¶

discovered = mesh.discover_peers()
for peer in discovered:
    print(f"{peer['name']} ({peer['source']}): {peer['url']}")

DNS-SD queries _trails._tcp.local for automatic discovery on local networks. Static config is always the primary source.

Background monitoring¶

Run periodic health checks in a daemon thread:

from trails.federation_mesh import MeshMonitor

monitor = MeshMonitor(mesh_manager=mesh, interval_seconds=60)
monitor.start()

# Your app runs...

print(f"Soft-removed peers: {mesh.soft_removed_peers}")
monitor.stop()

Health-aware query routing¶

Pass a MeshManager to FederatedQueryEngine for automatic graceful degradation:

engine = FederatedQueryEngine(
    peers=peers,
    store=ctx.kg._store,
    mesh_manager=mesh,
)

SERVICE blocks targeting unreachable peers are skipped with a warning (partial results instead of failure).

CLI commands¶

# Check health of all configured peers
trails federation status

# Discover peers (static + DNS-SD)
trails federation discover

# Ping a specific peer
trails federation ping pharma

Deployment¶

Docker¶

A minimal Dockerfile for a Trails app:

FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

docker-compose: two federated instances¶

version: "3.9"

services:
  research:
    build: ./research-app
    ports:
      - "8001:8000"
    environment:
      - TRAILS_FEDERATION_ENABLED=true
    volumes:
      - research-data:/app/data

  clinical:
    build: ./clinical-app
    ports:
      - "8002:8000"
    environment:
      - TRAILS_FEDERATION_ENABLED=true
    volumes:
      - clinical-data:/app/data

volumes:
  research-data:
  clinical-data:

The research app's trails.toml:

[project]
name = "research-pubs"

[federation]
enabled = true
max_query_time_ms = 10000

[federation.peers.clinical]
url = "http://clinical:8000/sparql"
label = "Clinical Evidence"
timeout_ms = 10000

The clinical app's trails.toml:

[project]
name = "clinical-evidence"

[federation]
enabled = true
max_query_time_ms = 10000

[federation.peers.research]
url = "http://research:8000/sparql"
label = "Research Publications"
timeout_ms = 10000

Helm (Kubernetes)¶

For Kubernetes deployments, a Helm chart packages each Trails instance as a Deployment + Service + ConfigMap (for trails.toml). The key considerations:

Each instance is a separate Deployment with its own trails.toml.
Peer URLs use Kubernetes service DNS (e.g., http://research-pubs.default.svc:8000/sparql).
Health probes point at /sparql/status.
Persistent volumes for the KG store and vector DB.

Performance tips¶

Choosing the right store backend¶

Trails supports Oxigraph (default, in-process) and Qlever (external, optimized for reads) as KG store backends:

Backend	Best for	Trade-off
Oxigraph	Write-heavy workloads, development, small-to-medium graphs	Fast writes, ACID transactions; query speed degrades on large graphs
Qlever	Read-heavy production, large graphs (millions of triples)	Exceptional query speed; writes require re-indexing

Rule of thumb: Start with Oxigraph (zero ops, embedded). Move to Qlever when query latency on production data becomes a bottleneck.

Federation performance¶

Set per-peer timeout_ms to avoid slow peers blocking your queries.
Use MeshManager with health-aware routing to skip unreachable peers.
Keep SERVICE blocks focused -- send the smallest possible sub-query to the remote peer and do joins locally.
Monitor with MeshMonitor and check federation:remote_query cost entries for latency trends.

Vector search performance¶

Use SqliteVecStore for development and small datasets (<100k chunks).
Switch to QdrantStore for production scale.
Use SentenceTransformerEmbedder (local) to avoid API latency and cost.
Prefer hybrid retrieval (mode="hybrid") to get precision from SPARQL and recall from vectors.

This complete example shows two instances -- a research publication graph and a clinical evidence graph -- querying each other.

Instance A: research publications¶

# research_app.py
from trails import capability, node_type

@node_type("Paper", fields={"title": str, "doi": str, "year": int})
class Paper: ...

@capability("load_papers")
def load_papers(ctx) -> dict:
    ctx.kg.add(Paper(title="Drug X interactions", doi="10.1234/abc", year=2025))
    ctx.kg.add(Paper(title="Drug Y meta-analysis", doi="10.1234/def", year=2026))
    return {"loaded": 2}

Instance B: clinical evidence¶

# clinical_app.py
from trails.federation import FederatedQueryEngine

peers = {
    "research": {
        "url": "http://research:8000/sparql",
        "timeout_ms": 10000,
    },
}
engine = FederatedQueryEngine(peers=peers, store=ctx.kg._store)

# Query papers from Instance A using SERVICE
result = engine.execute(ctx, """
    SELECT ?title ?doi WHERE {
      SERVICE <http://research:8000/sparql> {
        ?p a <trails://research-pubs/Paper> ;
           <trails://research-pubs/Paper/title> ?title ;
           <trails://research-pubs/Paper/doi> ?doi .
      }
    }
""")

for binding in result["results"]["bindings"]:
    print(binding["title"]["value"], binding["doi"]["value"])

Adding MCP relay¶

Instance B can also invoke capabilities on Instance A:

from trails import capability
from trails.federation_mcp import relay

@capability("cross_reference")
async def cross_reference(ctx, drug_name: str):
    # Invoke a capability on the research instance
    papers = await relay(
        ctx,
        peer="research",
        tool="search_papers",
        arguments={"query": drug_name},
    )

    # Combine with local clinical data
    local = ctx.kg.query(f"""
        SELECT ?trial WHERE {{
            ?trial a <trails://clinical/Trial> ;
                   <trails://clinical/Trial/drug> "{drug_name}" .
        }}
    """)

    return {
        "papers": papers,
        "trials": [r["trial"] for r in local],
    }

Deep dives¶

Federation guide -- full protocol spec, SERVICE rewriting internals, security considerations, cost tracking.
Policy guide -- Cedar policy syntax for federation access control.
MCP Integration guide -- the transport layer that capability relay builds on.
Observability guide -- cost tracking across federated queries.

What's next: You have now covered the full Trails surface -- from your first @capability (Chapter 1) through the ORM and knowledge graph (Chapters 2-4), agentic planning (Chapter 5), data integration (Chapter 6), and federation (Chapter 7). For the detailed API reference, see the guide index and ADR index. For real-world patterns, check the examples/ directory in the repository.