ADR-0037: Hypothesis-Driven Agent Loops — Scientific Reasoning on Knowledge Graphs¶
- Status: Accepted (2026-04-19)
- Date: 2026-04-18
Context¶
Trails ships three planning strategies (ADR-0018): ReAct (reason+act), Plan-and-Execute (decompose→dispatch→merge), and Reflexion (inner loop + critic). All three are task-completion planners — given a goal, they try tool calls until they arrive at an answer.
Knowledge graphs, however, invite a fundamentally different interaction pattern: scientific reasoning. An analyst does not just "find me the answer"; they observe patterns in the data, formulate a hypothesis about why the pattern exists (or what else must be true if it does), design a test that would confirm or refute the hypothesis, execute the test, update their confidence, and report with citations to the specific KG nodes that support the conclusion.
No existing planner in Trails — or, to our knowledge, any KG framework — ships a first-class hypothesis loop. The gap shows up in three concrete use cases:
- Root-cause analysis. An engineer sees anomalous triples in a supply-chain KG. The agent should propose candidate explanations, test each against the graph, and report the most-supported one — not just run a SPARQL query and return the result set.
- Data quality investigation. SHACL reports 47 violations. A human wants to know why — is it a single upstream source, a mapping bug, or a schema drift? The agent should formulate competing hypotheses and test them.
- Research assistance. A researcher exploring an academic KG asks "what factors correlate with X?" The agent should iterate: observe → hypothesize → test → refine, producing a grounded, citable report rather than a one-shot LLM hallucination.
The existing planners can be prompted toward this behaviour, but they lack the structure to track hypotheses as first-class objects, propagate confidence, or produce grounded citations. Bolting these on as prompt engineering is fragile and invisible to auditors.
Decision¶
1. New planner type: hypothesis¶
A new module trails.agent.planners.hypothesis ships alongside
react, plan_and_execute, and reflexion. It follows the same
PlanningStrategy protocol (same run() signature shape, returns
PlanResult), so it slots into the existing session infrastructure
with zero changes to the harness.
2. Hypothesis lifecycle¶
Each hypothesis passes through a five-phase cycle:
Phase details:
- Observe: Query the KG (via
ctx.kg.query()) with a broad exploratory SPARQL query generated or selected by the LLM. The result is summarized and fed back as context. - Hypothesize: The LLM proposes a testable statement based on
the observations. Stored as a
Hypothesisdataclass withstatus=proposed,confidence=0.5(uninformative prior). - Test: The LLM designs a SPARQL query or capability call that
would produce evidence for or against the hypothesis. Executed
via
ctx.kg.query()orinvoke(). - Update: Based on test results, the LLM adjusts confidence
(always clamped to [0, 1]). If
confidence >= min_confidence→status=supported. If confidence drops below a refutation threshold or the LLM explicitly refutes →status=refuted. Otherwisestatus=testingand the loop continues. - Report: Once a hypothesis is supported (or all are exhausted), generate a natural-language report with inline citations to KG nodes.
3. Hypotheses as first-class KG nodes¶
persist_hypothesis(ctx, hypothesis) writes the hypothesis into the
provenance graph as a trails:Hypothesis node with PROV-O lineage:
prov:wasGeneratedBy→ thetrails:HypothesisPlanactivityprov:wasAssociatedWith→ the session principaltrails:confidence→ the final confidence floattrails:supportingEvidence/trails:contradictingEvidence→ IRI lists pointing to the KG nodes cited as evidence
This means the graph is self-documenting: downstream queries can ask "what hypotheses has this agent tested?" and "what evidence supports this conclusion?" without leaving SPARQL.
4. Confidence propagation¶
Conclusions cite their supporting evidence nodes. Confidence is a scalar float in [0, 1] updated by the LLM at each test step. The delta per step is clamped to [-1, 1] and the resulting confidence is always clamped to [0, 1]. No Bayesian machinery in Phase 1 — the LLM is the estimator. Future phases may introduce formal confidence networks.
5. Grounded citations¶
Every claim in the final report links to the KG nodes that support
it via Citation dataclass (node_iri, relevance, field,
value). The report generation step requires the LLM to reference
specific IRIs from the test results — ungrounded claims are a
failure mode the prompt explicitly guards against.
6. Compatibility¶
- Uses the existing
Session/LLMClient/Contextinfrastructure — no new runtime dependencies. - Returns
PlanResultwithstrategy="hypothesis"so session replay and A/B comparison across strategies works unchanged. - Registers
HypothesisStepphases inPlanStep.thought/PlanStep.actionso the existing trajectory viewer renders hypothesis loops without modification. - Budget enforcement (
max_cost_usd,max_tokens,max_wall_time_s) is inherited from the sharedcheck_budgethelper.
Non-goals¶
- Bayesian confidence networks. Phase 1 uses LLM-estimated confidence as a scalar. Formal probabilistic reasoning is a future extension.
- Multi-hypothesis DAGs. Phase 1 tests hypotheses sequentially. Parallel hypothesis testing with dependency graphs is out of scope.
- Automated hypothesis generation from schema. The LLM generates hypotheses from observations; the framework does not auto-generate candidates from SHACL shapes or OWL axioms.
- Confidence thresholds as policy. Cedar policies do not gate on hypothesis confidence in Phase 1.
Dependencies¶
| ADR | Relationship |
|---|---|
| ADR-0018 (Agent Planners) | Planner protocol; PlanResult/PlanStep; budget enforcement |
| ADR-0009 (PROV-O) | Provenance integration; hypothesis nodes in prov graph |
| ADR-0021 (Progressive Enhancement) | Surface compatibility; no new tiers |
Consequences¶
Positive¶
- Unique capability. No KG framework ships a hypothesis-driven agent loop. The combination of structured reasoning + KG grounding
- PROV-O lineage is a genuine differentiator.
- Auditability. Every hypothesis, test, and confidence update is recorded in the KG. Auditors can trace conclusions back to evidence without replaying the agent session.
- Grounded output. Citations force the LLM to reference actual KG data, reducing hallucination risk compared to free-form summarization.
- Composable. The hypothesis planner composes with existing infrastructure (Session, LLMClient, capabilities) and can be wrapped by Reflexion for critic-reviewed hypothesis reports.
Negative¶
- LLM cost. The observe→hypothesize→test→update cycle is
inherently multi-turn. Mitigated by:
max_hypothesesandmax_stepscaps, budget enforcement, andmin_confidenceearly termination. - Confidence calibration. LLM-estimated confidence is not statistically calibrated. Mitigated by: treating confidence as a heuristic ranking signal, not a probability; documenting the limitation; planning formal calibration for Phase 2.
- Prompt complexity. The hypothesis system prompt is larger than ReAct's. Mitigated by: the same minification techniques used in ReAct's tool catalog.
Revisit conditions¶
- If Bayesian confidence networks prove tractable, upgrade the scalar confidence to a formal posterior.
- If multi-hypothesis parallelism is needed, extend the loop to support DAG-structured hypothesis trees.
- If LLM confidence proves unreliable in practice, introduce calibration via held-out test sets or human feedback.