ADR-0043: State Machine Pattern for Federation Peers and Capability Invocations¶

Status: Accepted
Date: 2026-04-18
Relates to: ADR-0023 (SPARQL Federation Instance Mesh)

Context¶

Federation peer tracking currently uses simple boolean/string status values ("healthy", "degraded", "unreachable") with no formal lifecycle. A peer jumps directly between states without guards, transition validation, or history. There is no way to answer "when did this peer become degraded?" or "how many times has it recovered?"

Capability invocations have a similar gap: invoke() runs through validation, policy, execution, and provenance steps, but there is no explicit lifecycle model. If a step fails, the only evidence is the exception — there is no queryable state trail showing where in the pipeline the invocation stopped.

Both problems point at the same missing primitive: a lightweight finite state machine (FSM) with transition guards, side-effect hooks, and history tracking.

Decision¶

Introduce a generic, reusable StateMachine class in trails.fsm (new module, zero external dependencies) and apply it to two domains:

1. Generic FSM (`trails.fsm`)¶

Transition(from_state, to_state, guard?, on_enter?) — a single allowed edge in the state graph.
StateMachine(name, initial, transitions) — holds current state, validates transitions, fires guards/callbacks, records history.
transition(to, **context) -> bool — attempt a state change; returns False if no valid transition exists or a guard rejects.
can_transition(to) -> bool — check without side effects.
history() -> list[(from, to, timestamp)] — full audit trail.

2. Peer FSM (federation_mesh)¶

DISCOVERED → PROBING → HEALTHY → DEGRADED → UNREACHABLE → REMOVED
                  ↑         ↓          ↓          ↑
                  └─────────┴──────────┘          │
                  DEGRADED → HEALTHY (recovered)  │
                  UNREACHABLE → PROBING (retry)   │

States: - DISCOVERED — peer is known from config or DNS-SD but never probed. - PROBING — health check in progress. - HEALTHY — all endpoints responding. - DEGRADED — partial response (e.g., SPARQL ok but MCP down). - UNREACHABLE — all probes failed. - REMOVED — soft-removed after repeated failures.

Replaces the string status field on PeerHealth with an FSM instance per peer, stored in MeshManager._peer_fsm.

3. Invocation FSM (runtime)¶

RECEIVED → VALIDATING → AUTHORIZED → EXECUTING → COMPLETED
                ↓              ↓            ↓
             REJECTED       DENIED       FAILED

States: - RECEIVED — invoke() entered, args parsed. - VALIDATING — checking required params and shape constraints. - AUTHORIZED — policy evaluation passed (or no policy configured). - EXECUTING — handler running. - COMPLETED — handler returned successfully, provenance attached. - REJECTED — validation failed (missing args, shape violation). - DENIED — policy evaluation returned DENY. - FAILED — handler raised an exception.

The FSM is created per invoke() call, its terminal state is recorded in the response envelope (or exception metadata), and its history is available for observability consumers.

Consequences¶

Positive: peer lifecycle is explicit, auditable, and extensible (add new states like QUARANTINED without touching boolean logic).
Positive: invocation lifecycle gives observability hooks a structured state trail instead of ad-hoc event names.
Positive: generic FSM is reusable for future lifecycle needs (agent sessions, schema migrations, etc.).
Negative: slight overhead per invocation (one dict + list allocation). Negligible compared to handler execution time.
Risk: existing code that reads PeerHealth.status as a raw string must be updated. Mitigated by keeping the string values identical to the old ones ("healthy", "degraded", "unreachable").