Skip to content

ADR-0037: Hypothesis-Driven Agent Loops — Scientific Reasoning on Knowledge Graphs

  • Status: Accepted (2026-04-19)
  • Date: 2026-04-18

Context

Trails ships three planning strategies (ADR-0018): ReAct (reason+act), Plan-and-Execute (decompose→dispatch→merge), and Reflexion (inner loop + critic). All three are task-completion planners — given a goal, they try tool calls until they arrive at an answer.

Knowledge graphs, however, invite a fundamentally different interaction pattern: scientific reasoning. An analyst does not just "find me the answer"; they observe patterns in the data, formulate a hypothesis about why the pattern exists (or what else must be true if it does), design a test that would confirm or refute the hypothesis, execute the test, update their confidence, and report with citations to the specific KG nodes that support the conclusion.

No existing planner in Trails — or, to our knowledge, any KG framework — ships a first-class hypothesis loop. The gap shows up in three concrete use cases:

  1. Root-cause analysis. An engineer sees anomalous triples in a supply-chain KG. The agent should propose candidate explanations, test each against the graph, and report the most-supported one — not just run a SPARQL query and return the result set.
  2. Data quality investigation. SHACL reports 47 violations. A human wants to know why — is it a single upstream source, a mapping bug, or a schema drift? The agent should formulate competing hypotheses and test them.
  3. Research assistance. A researcher exploring an academic KG asks "what factors correlate with X?" The agent should iterate: observe → hypothesize → test → refine, producing a grounded, citable report rather than a one-shot LLM hallucination.

The existing planners can be prompted toward this behaviour, but they lack the structure to track hypotheses as first-class objects, propagate confidence, or produce grounded citations. Bolting these on as prompt engineering is fragile and invisible to auditors.

Decision

1. New planner type: hypothesis

A new module trails.agent.planners.hypothesis ships alongside react, plan_and_execute, and reflexion. It follows the same PlanningStrategy protocol (same run() signature shape, returns PlanResult), so it slots into the existing session infrastructure with zero changes to the harness.

2. Hypothesis lifecycle

Each hypothesis passes through a five-phase cycle:

Observe KG → Formulate hypothesis → Design test → Execute test → Update confidence

Phase details:

  • Observe: Query the KG (via ctx.kg.query()) with a broad exploratory SPARQL query generated or selected by the LLM. The result is summarized and fed back as context.
  • Hypothesize: The LLM proposes a testable statement based on the observations. Stored as a Hypothesis dataclass with status=proposed, confidence=0.5 (uninformative prior).
  • Test: The LLM designs a SPARQL query or capability call that would produce evidence for or against the hypothesis. Executed via ctx.kg.query() or invoke().
  • Update: Based on test results, the LLM adjusts confidence (always clamped to [0, 1]). If confidence >= min_confidencestatus=supported. If confidence drops below a refutation threshold or the LLM explicitly refutes → status=refuted. Otherwise status=testing and the loop continues.
  • Report: Once a hypothesis is supported (or all are exhausted), generate a natural-language report with inline citations to KG nodes.

3. Hypotheses as first-class KG nodes

persist_hypothesis(ctx, hypothesis) writes the hypothesis into the provenance graph as a trails:Hypothesis node with PROV-O lineage:

  • prov:wasGeneratedBy → the trails:HypothesisPlan activity
  • prov:wasAssociatedWith → the session principal
  • trails:confidence → the final confidence float
  • trails:supportingEvidence / trails:contradictingEvidence → IRI lists pointing to the KG nodes cited as evidence

This means the graph is self-documenting: downstream queries can ask "what hypotheses has this agent tested?" and "what evidence supports this conclusion?" without leaving SPARQL.

4. Confidence propagation

Conclusions cite their supporting evidence nodes. Confidence is a scalar float in [0, 1] updated by the LLM at each test step. The delta per step is clamped to [-1, 1] and the resulting confidence is always clamped to [0, 1]. No Bayesian machinery in Phase 1 — the LLM is the estimator. Future phases may introduce formal confidence networks.

5. Grounded citations

Every claim in the final report links to the KG nodes that support it via Citation dataclass (node_iri, relevance, field, value). The report generation step requires the LLM to reference specific IRIs from the test results — ungrounded claims are a failure mode the prompt explicitly guards against.

6. Compatibility

  • Uses the existing Session / LLMClient / Context infrastructure — no new runtime dependencies.
  • Returns PlanResult with strategy="hypothesis" so session replay and A/B comparison across strategies works unchanged.
  • Registers HypothesisStep phases in PlanStep.thought / PlanStep.action so the existing trajectory viewer renders hypothesis loops without modification.
  • Budget enforcement (max_cost_usd, max_tokens, max_wall_time_s) is inherited from the shared check_budget helper.

Non-goals

  • Bayesian confidence networks. Phase 1 uses LLM-estimated confidence as a scalar. Formal probabilistic reasoning is a future extension.
  • Multi-hypothesis DAGs. Phase 1 tests hypotheses sequentially. Parallel hypothesis testing with dependency graphs is out of scope.
  • Automated hypothesis generation from schema. The LLM generates hypotheses from observations; the framework does not auto-generate candidates from SHACL shapes or OWL axioms.
  • Confidence thresholds as policy. Cedar policies do not gate on hypothesis confidence in Phase 1.

Dependencies

ADR Relationship
ADR-0018 (Agent Planners) Planner protocol; PlanResult/PlanStep; budget enforcement
ADR-0009 (PROV-O) Provenance integration; hypothesis nodes in prov graph
ADR-0021 (Progressive Enhancement) Surface compatibility; no new tiers

Consequences

Positive

  • Unique capability. No KG framework ships a hypothesis-driven agent loop. The combination of structured reasoning + KG grounding
  • PROV-O lineage is a genuine differentiator.
  • Auditability. Every hypothesis, test, and confidence update is recorded in the KG. Auditors can trace conclusions back to evidence without replaying the agent session.
  • Grounded output. Citations force the LLM to reference actual KG data, reducing hallucination risk compared to free-form summarization.
  • Composable. The hypothesis planner composes with existing infrastructure (Session, LLMClient, capabilities) and can be wrapped by Reflexion for critic-reviewed hypothesis reports.

Negative

  • LLM cost. The observe→hypothesize→test→update cycle is inherently multi-turn. Mitigated by: max_hypotheses and max_steps caps, budget enforcement, and min_confidence early termination.
  • Confidence calibration. LLM-estimated confidence is not statistically calibrated. Mitigated by: treating confidence as a heuristic ranking signal, not a probability; documenting the limitation; planning formal calibration for Phase 2.
  • Prompt complexity. The hypothesis system prompt is larger than ReAct's. Mitigated by: the same minification techniques used in ReAct's tool catalog.

Revisit conditions

  • If Bayesian confidence networks prove tractable, upgrade the scalar confidence to a formal posterior.
  • If multi-hypothesis parallelism is needed, extend the loop to support DAG-structured hypothesis trees.
  • If LLM confidence proves unreliable in practice, introduce calibration via held-out test sets or human feedback.