ADR-0043: State Machine Pattern for Federation Peers and Capability Invocations¶
- Status: Accepted
- Date: 2026-04-18
- Relates to: ADR-0023 (SPARQL Federation Instance Mesh)
Context¶
Federation peer tracking currently uses simple boolean/string status
values ("healthy", "degraded", "unreachable") with no formal
lifecycle. A peer jumps directly between states without guards,
transition validation, or history. There is no way to answer "when did
this peer become degraded?" or "how many times has it recovered?"
Capability invocations have a similar gap: invoke() runs through
validation, policy, execution, and provenance steps, but there is no
explicit lifecycle model. If a step fails, the only evidence is the
exception — there is no queryable state trail showing where in the
pipeline the invocation stopped.
Both problems point at the same missing primitive: a lightweight finite state machine (FSM) with transition guards, side-effect hooks, and history tracking.
Decision¶
Introduce a generic, reusable StateMachine class in trails.fsm
(new module, zero external dependencies) and apply it to two domains:
1. Generic FSM (trails.fsm)¶
Transition(from_state, to_state, guard?, on_enter?)— a single allowed edge in the state graph.StateMachine(name, initial, transitions)— holds current state, validates transitions, fires guards/callbacks, records history.transition(to, **context) -> bool— attempt a state change; returnsFalseif no valid transition exists or a guard rejects.can_transition(to) -> bool— check without side effects.history() -> list[(from, to, timestamp)]— full audit trail.
2. Peer FSM (federation_mesh)¶
DISCOVERED → PROBING → HEALTHY → DEGRADED → UNREACHABLE → REMOVED
↑ ↓ ↓ ↑
└─────────┴──────────┘ │
DEGRADED → HEALTHY (recovered) │
UNREACHABLE → PROBING (retry) │
States: - DISCOVERED — peer is known from config or DNS-SD but never probed. - PROBING — health check in progress. - HEALTHY — all endpoints responding. - DEGRADED — partial response (e.g., SPARQL ok but MCP down). - UNREACHABLE — all probes failed. - REMOVED — soft-removed after repeated failures.
Replaces the string status field on PeerHealth with an FSM
instance per peer, stored in MeshManager._peer_fsm.
3. Invocation FSM (runtime)¶
States:
- RECEIVED — invoke() entered, args parsed.
- VALIDATING — checking required params and shape constraints.
- AUTHORIZED — policy evaluation passed (or no policy configured).
- EXECUTING — handler running.
- COMPLETED — handler returned successfully, provenance attached.
- REJECTED — validation failed (missing args, shape violation).
- DENIED — policy evaluation returned DENY.
- FAILED — handler raised an exception.
The FSM is created per invoke() call, its terminal state is recorded
in the response envelope (or exception metadata), and its history is
available for observability consumers.
Consequences¶
- Positive: peer lifecycle is explicit, auditable, and extensible (add new states like QUARANTINED without touching boolean logic).
- Positive: invocation lifecycle gives observability hooks a structured state trail instead of ad-hoc event names.
- Positive: generic FSM is reusable for future lifecycle needs (agent sessions, schema migrations, etc.).
- Negative: slight overhead per invocation (one dict + list allocation). Negligible compared to handler execution time.
- Risk: existing code that reads
PeerHealth.statusas a raw string must be updated. Mitigated by keeping the string values identical to the old ones ("healthy","degraded","unreachable").