Skip to content

Agent Runtime

trails.agent is the framework-owned planning loop. @capability + invoke() is pure dispatch; a planner turns a natural-language goal into a trajectory of dispatches. M9 Phase 3 ships three strategies — ReAct, Plan-and-Execute, and Reflexion (ADR-0018) — behind the PlanningStrategy protocol so swapping one for another is a one-line import change. The planner is a thin orchestrator: it schedules invoke(), does not replace it, and every step lands in the prov: graph and cost envelope the kernel owns. See the LLM Client & Session guide for the primitives this loop is built on.

Quickstart

from trails import capability
from trails.agent import Session
from trails.agent.planners import react
from trails.llm import LLMClient

@capability(id="notes.search", description="Search notes by tag.")
def search(ctx, tag: str) -> dict:
    return {"hits": ["n1", "n2"]}

# Pick any provider — same API:
client = LLMClient.ollama(model="qwen3:8b")          # local, free
# client = LLMClient.anthropic(model="claude-sonnet-4-5")  # cloud
# client = LLMClient.mock(response="finish")               # tests

sess = Session(principal="did:local:alice")
result = react.run("Find urgent notes.", llm=client, session=sess)
print(result.answer, "in", len(result.steps), "steps")

No tool list is passed — the planner walks the live @capability registry. Pass tools=[...] to scope the set, and ctx=ctx to get automatic cost attribution and PROV-O linkage.

Planning strategies

All three strategies share the same core signature — run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_cost_usd=None, max_tokens=None, max_wall_time_s=None, ...) — so swapping one for another is a one-line import change. Each has a handful of strategy-specific kwargs on top (max_replans, max_outer_iterations) documented below. Budget kwargs (max_cost_usd / max_tokens / max_wall_time_s) are detailed in Budget limits; the tools= / tool_filter= scoping story is in Filtering the tool catalogue.

Pick by trajectory shape. Use ReAct when the next step depends on the last observation and the task fits under max_steps (search-then-summarize, classify-then-route). Use Plan-and-Execute when a clear decomposition exists up front and you want one reasoning turn to cover many actions — it is cheaper per step on long-horizon workflows and fails more gracefully on individually-checkable sub-tasks. Use Reflexion when goal correctness is hard to verify in a single pass (rubric-matching, soft constraints, answer quality) and a silently-wrong answer is expensive enough to pay for a critic turn.

ReAct — trails.agent.planners.react

Think / act / observe. On every turn the LLM emits a JSON block containing thought, action, and action_input; the planner dispatches the action through trails.invoke, feeds the observation back as the next user message, and loops. action == "finish" terminates with the final_answer field as PlanResult.answer. Tool errors (TrailsError from dispatch) are not fatal — they become the step's observation so the LLM can self-correct.

When to use. Simple goals where the next step depends on the last observation: search-then-summarize, classify-then-route, read-then-decide. ReAct is the default when the task fits under max_steps (default 10) and the LLM can reason about intermediate results without a global plan. Not for long-horizon trajectories (> 10 steps) or branch-heavy workflows — prefer Plan-and-Execute once it lands.

Plan-and-Execute — trails.agent.planners.plan_and_execute

Two explicit phases. Plan: one LLM call emits a fenced ```json block holding an ordered plan list whose entries carry action, action_input, and rationale, and which must terminate with an action == "finish" step. Execute: the kernel runs each step through trails.invoke. Replanning is triggered when a step returns a TrailsError (unknown id, validation failure, policy denial) — the loop appends the observation to the session, asks the LLM for a revised plan given what already happened, and resumes. max_replans (default 3) caps how often that may happen; max_steps remains the hard ceiling on executed actions across all replans. A single malformed plan earns one nudge-retry; a second failure terminates with stopped="error". Each parsed plan is appended to session.metadata["plans"] for post-hoc inspection.

Signature.

plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10,
    tools=None, tool_filter=None, max_replans=3,
    max_tokens_per_call=1024, temperature=0.0,
    max_cost_usd=None, max_tokens=None, max_wall_time_s=None)

When to use. Long-horizon multi-step workflows with a clear decomposition the LLM can emit in one shot ("ingest 10 files, validate each, export deltas"). Tasks where each step's success is individually checkable, and where one upfront reasoning turn amortises better than a think/act turn per step.

Worked example.

import json
from trails import capability
from trails.agent import Session
from trails.agent.planners import plan_and_execute
from trails.llm import LLMClient

@capability(id="notes.search", description="Search notes by tag.")
def search(ctx, tag: str) -> dict:
    return {"hits": ["n1", "n2"]}

plan = {"plan": [
    {"action": "notes.search", "action_input": {"tag": "urgent"}, "rationale": "list"},
    {"action": "finish", "action_input": {}, "final_answer": "2 urgent notes."},
]}
fenced = "```json\n" + json.dumps(plan) + "\n```"
client = LLMClient.mock(response=lambda _m: fenced)
result = plan_and_execute.run("Find urgent notes.", llm=client, session=Session(principal="did:local:alice"))

See internal planning and ADR-0018 Phase 3 for the design rationale.

Reflexion — trails.agent.planners.reflexion

A ReAct inner loop under a Critic outer loop. Each outer iteration runs react.run(...) once with the supplied max_steps; the resulting PlanResult is handed to a critic LLM call built on a fresh message list (so the critic's own prompt does not pollute the agent chat). The critic replies with a fenced ```json block {"verdict": "accept" | "retry", "critique": "..."}. On "accept" the loop terminates with stopped="goal_achieved"; on "retry" the critique is appended to the session as a user message tagged [reflexion critique #<n>] so the next inner ReAct pass reads it as chat history and can avoid the named defect. Outer iteration is capped by max_outer_iterations (default 3); exhaustion returns the latest inner result with stopped="max_steps", matching ReAct's step-budget semantics so A/B comparisons stay uniform. The ordered list of critic replies lives on session.metadata["critiques"]. A malformed critic reply earns one nudge-retry; a second failure terminates with stopped="error". Inner errors propagate unchanged — no amount of reflection fixes a dispatch crash.

Signature.

reflexion.run(goal, *, llm, session, ctx=None, max_steps=10,
    tools=None, tool_filter=None, max_outer_iterations=3,
    max_tokens_per_call=1024, temperature=0.0,
    max_cost_usd=None, max_tokens=None, max_wall_time_s=None)

When to use. Goal correctness is hard to judge in one pass — answer quality, rubric-matching, soft constraints, tasks where self-critique empirically improves output. The overhead of a critic turn pays off when a silently-wrong answer is expensive.

Worked example.

from trails.agent import Session
from trails.agent.planners import reflexion
from trails.llm import LLMClient, Message

inner = '```json\n{"thought":"done","action":"finish","action_input":{},"final_answer":"ok"}\n```'
critic = '```json\n{"verdict":"accept","critique":""}\n```'
def route(msgs: list[Message]) -> str:
    sys = next((m.content for m in msgs if m.role == "system"), "")
    return critic if "Reflexion critic" in sys else inner
client = LLMClient.mock(response=route)
result = reflexion.run("Answer the goal.", llm=client, session=Session(principal="did:local:alice"))

See internal planning and ADR-0018 Phase 3 for the design rationale.

Tool discovery

Without a tools= kwarg, react.run walks the live trails.decorators._handlers registry and exposes every registered @capability. Each tool entry in the system prompt carries its description, parameter names, type annotations (stringified), and required/optional flag — enough for the LLM to fill in action_input without a manifest round-trip. Pass an explicit list to scope:

react.run(goal, llm=c, session=s, tools=["notes.search", "notes.tag"])

or a callable (re-evaluated once at run start):

react.run(goal, llm=c, session=s, tools=lambda: pick_tools(context))

Unknown ids are surfaced to the prompt as (unregistered) rather than silently dropped — the LLM sees the gap and can reason about it.

Bloat concern. The prompt grows linearly with tool count. Around 50+ capabilities it dominates the input token bill and the LLM's attention budget; scope tools= to what the goal needs (static list per route, or a retrieval step before the loop).

Filtering the tool catalogue

tools= accepts four shapes (ToolsSpec in trails.agent.planners.react): None = discover every registered capability; list[str] = use exactly those ids; Callable[[], list[str]] = zero-arg callable resolved once at run() entry; int N = keep the N most-recently-registered capabilities (tail of the insertion-ordered _handlers dict). bool is explicitly rejected to prevent the True/False → keep-one-or-zero footgun.

# Explicit static list — deterministic tests, lean prompt.
react.run(goal, llm=c, session=s, tools=["notes.search", "notes.tag"])

# Top-N most recently registered — handy when plugins append tools.
react.run(goal, llm=c, session=s, tools=5)

tool_filter= is a second-stage post-filter applied to the list produced by tools=. It accepts:

  • None — no filter.
  • Callable[[dict], bool] — predicate on the per-tool metadata dict (id, description, params, required, annotations, input_shape, output_shape); return True to keep.
  • str — whitespace-separated keywords; a tool is kept iff its id or description contains at least one keyword (case-insensitive substring). An empty string passes everything.
# Keyword pre-filter driven by goal text.
react.run(goal, llm=c, session=s, tool_filter="notes search tag")

# Arbitrary predicate — e.g. keep read-only tools only.
react.run(
    goal, llm=c, session=s,
    tool_filter=lambda meta: "read-only" in meta["annotations"].get("tags", ""),
)

Why filter. Two wins:

  1. Lean prompt. Every tool entry costs input tokens every turn. At ~30 discovered capabilities discover_tools emits a UserWarning nudging you to filter; at 50+ the catalogue can dominate both cost and the LLM's attention. Scoping to the ~handful a goal actually needs is the single biggest lever on planner cost.
  2. Deterministic tests. A fixed tools=[...] list shields a test from plugin-order drift and from unrelated capabilities registered by conftest fixtures. The planner sees exactly what the test chose, nothing more.

Both kwargs are available on all three planners (react.run, plan_and_execute.run, reflexion.run) with identical semantics.

Budget limits

Three cumulative-budget kwargs cap a planner run without depending on max_steps. All three are None (= unlimited) by default and are available on react.run, plan_and_execute.run, and reflexion.run:

Kwarg Breach → PlanResult.stopped Sampled from
max_cost_usd: float \| None "max_cost" cost_tracker.total_usd()
max_tokens: int \| None "max_tokens" cost_tracker.total_tokens() (prompt + completion)
max_wall_time_s: float \| None "max_wall_time" time.monotonic() since run() entry

All three read from the shared CostTracker resolved via trails.llm._module_tracker() — the same tracker every LLMClient.complete books into, so caps are authoritative across the planner itself and any capability-internal LLM calls booked during the run. Child rows that dedupe="child" against a scope's call_id are excluded from total_usd() / total_tokens(), so a nested LLM call isn't double-counted against the budget.

When the check runs. At iteration boundaries, not mid-step: react.run checks before each turn's LLM call; plan_and_execute.run checks at the top of every planned step and before paying for a replan; reflexion.run checks at the top of each outer iteration AND threads the remaining wall-time into the inner ReAct so a long inner run aborts mid-iteration. Semantics: the in-flight operation finishes, the loop aborts, a best-effort trails:wasTerminatedBy triple is written to the prov: graph (logged-and-swallowed on failure), and a PlanResult is returned.

Result shape on abort.

  • PlanResult.stopped carries the breach reason ("max_cost" / "max_tokens" / "max_wall_time") rather than "goal_achieved".
  • PlanResult.steps is the trajectory recorded up to the abort — all prior observations are intact.
  • PlanResult.answer is not a final answer. ReAct and Plan-and-Execute stringify the last observation (JSON-encoded when it isn't already a str, via json.dumps(..., default=str)); Reflexion returns the last inner ReAct run's answer unchanged. PlanResult.error stays None — a budget breach is a clean stop, not an error.

There is no separate warning field on PlanResult; callers inspect stopped to detect an abort. discover_tools emits a UserWarning (Python warnings) when auto-discovery exceeds the soft cap of 30 capabilities and no tool_filter is set — that is the only warning-channel surface tied to this loop.

Extracting best-so-far

Because answer on a budget-abort is the last observation rather than a typed final answer, apps that need structured output on abort must extract it themselves. The canonical recipe:

from trails.agent.planners import react

result = react.run(
    goal,
    llm=c,
    session=s,
    max_cost_usd=0.25,
    max_wall_time_s=30.0,
)

if result.stopped == "goal_achieved":
    final = result.answer                         # trusted final answer
elif result.stopped in {"max_cost", "max_tokens", "max_wall_time"}:
    # Best-so-far: walk the trajectory tail-first for the most recent
    # non-error, non-"finish" observation and parse it.
    best = None
    for step in reversed(result.steps):
        obs = step.observation
        if obs is None or (isinstance(obs, str) and obs.startswith("error:")):
            continue
        best = obs
        break
    # ``result.answer`` is the same observation stringified — cheapest
    # fallback when a dict isn't required.
    final = best if best is not None else result.answer
else:  # "error"
    raise RuntimeError(result.error)

Walking result.steps wins over parsing result.answer whenever the caller needs the raw observation dict (for schema validation, partial extraction, etc.) instead of its JSON stringification. For Plan-and-Execute, the same pattern applies; additionally, session.metadata["plans"] holds every parsed plan in order, so a caller can recover the un-executed tail of the final plan if it needs to resume later. For Reflexion, session.metadata["critiques"] holds the ordered critic replies — useful context for a follow-up run that starts fresh under a raised budget.

Session and Context

Session (from trails.agent.context, covered in the LLM guide) is the conversation-state ledger the planner writes into: system prompt, assistant turns, observations, plus a parallel invocations list of raw envelopes. Reuse one Session across react.run calls to continue a conversation; create a fresh one per independent goal.

ctx is the per-invoke trails.context.Context — the same object a @capability handler receives. Pass ctx=ctx and every LLM call bills through the cost tracker tagged llm:<model> and emits a trails:LLMCompletion activity, plus a trails:ReActPlan root activity linked to each step. Without ctx the loop still runs, but without telemetry. Tests can omit it; production code should not.

Cost and Provenance

When ctx is threaded through, accounting is automatic:

  • One LLM call per planner step → one cost envelope and one trails:LLMCompletion activity.
  • One trails:ReActPlan root activity per run(), tagged with goal, principal, and session.id.
  • Every invoke() inside the loop still emits its own prov:Activity and cost envelope — nothing new on that side.

Open question. A capability called from a ReAct loop may itself call LLMClient.complete() (e.g., a summarize handler). Both the planner step and the capability-internal call open cost envelopes; naïve nesting double-counts the capability's LLM spend. ADR-0018 Open Question #6 flags this; until Phase 4 resolves it, handlers that do their own LLM work inside a planner session should subtract their spend from the parent envelope in reports, or avoid the overlap.

Error handling

The loop survives two failure modes cleanly. A malformed LLM reply (no JSON block, non-dict payload, missing action) earns one retry with a format-contract nudge appended to the session; a second malformed turn terminates with stopped="error" and the parse error in PlanResult.error. A TrailsError from the tool call (unknown capability id, validation failure, policy denial) becomes the step's observation ("error: <message>") and the loop continues — the LLM sees the error and can pick a different tool.

max_steps is a hard budget that counts every turn including the malformed-reply retry. On exhaustion the planner returns stopped="max_steps" with the last observation as the best-effort answer; the trajectory is still intact on result.steps. The cumulative-budget kwargs (max_cost_usd / max_tokens / max_wall_time_s) are orthogonal ceilings with matching best-effort semantics — see Budget limits and Extracting best-so-far. The only fatal path is an LLM-provider error bubbled out of LLMClient.complete — those surface as stopped="error" with the provider message.

Anti-patterns

  • Exposing every capability by default when the registry is huge. At 50+ tools the system prompt dominates cost and drowns the LLM's attention. Scope tools= to what the goal actually needs.
  • Long system prompts without Anthropic prompt caching. The ReAct prompt is a fixed framing re-sent every turn. Pair LLMClient.anthropic(cache=True) with Message(cache=True) on the system block to keep re-use free.
  • Forgetting ctx. The loop runs, but cost and provenance go silent — the worst failure mode. Forward ctx from the outer handler.
  • New Session per turn when you meant to chain. Rebuilding the session drops history; reuse the instance.

Reference

trails.agent

Symbol One-line summary
Session(*, principal, max_tokens=32_000, pin_head=1, auth=None, card=None) Per-run state: history, invocations, auth, WoT card.
TokenWindow(*, max_tokens=32_000, pin_head=1) FIFO window with pinned head and running-total eviction.
PlanningStrategy Runtime-checkable Protocol every planner satisfies.
PlanResult(answer, steps, stopped, error=None, strategy="react") Terminal output of a planner run.
PlanStep(thought, action, action_input={}, observation=None, activity_iri=None) One think-act-observe turn.
react The ReAct planner module — entry point is react.run(...).
plan_and_execute The Plan-and-Execute planner module — entry point is plan_and_execute.run(...).
reflexion The Reflexion planner module — entry point is reflexion.run(...).

trails.agent.planners

Symbol One-line summary
react.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_tokens_per_step=1024, temperature=0.0, max_cost_usd=None, max_tokens=None, max_wall_time_s=None) Run the ReAct loop until finish, max_steps, a budget breach, or a fatal error.
plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_replans=3, max_tokens_per_call=1024, temperature=0.0, max_cost_usd=None, max_tokens=None, max_wall_time_s=None) Plan once, execute each step; replan on TrailsError up to max_replans times.
reflexion.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_outer_iterations=3, max_tokens_per_call=1024, temperature=0.0, max_cost_usd=None, max_tokens=None, max_wall_time_s=None) Run ReAct under a critic; on retry verdict, append critique and re-run up to max_outer_iterations times.
react.discover_tools(tools=None, *, tool_filter=None) List (capability_id, metadata) pairs, either from an explicit list / callable / int-N / the live registry, optionally post-filtered.
react.build_system_prompt(goal, tool_specs) Assemble the ReAct system prompt (format contract + goal + tool list).
react.parse_step(text) Extract the JSON turn payload from a raw LLM reply; tolerates surrounding prose.
PlanningStrategy / PlanResult / PlanStep Re-exported from trails.agent.planners.base.
StopReason Alias for str: one of "goal_achieved", "max_steps", "max_cost", "max_tokens", "max_wall_time", "error".

Persistent Memory

Agent sessions are ephemeral by default — knowledge is lost when the session ends. For persistent, cross-session knowledge that survives across tools and agents, use Agent Memory (guide, ADR-0051):

# Agent persists a conclusion to shared memory
trails.invoke("memory.learn", {
    "content": "The SPARQL proxy rejects SERVICE queries",
    "confidence": 0.99,
    "topic": "security",
    "agent_did": session.auth.did,
    "scope": "shared",
})

# Another agent (or same agent, later session) recalls it
trails.invoke("memory.recall", {
    "context": "SPARQL security",
    "scope": "shared",
})

Memory complements sessions: sessions hold conversation context; memory holds learned knowledge. The two integrate naturally — an agent can persist its high-confidence conclusions to memory at session end.

See also: the LLM Client & Session guide for the client and session primitives the loop is built on, the Capabilities guide for the dispatch surface every action flows through, the Agent Memory guide for persistent cross-agent knowledge, and ADR-0018 §Phased delivery for what Phase 3+ will add under the same planner protocol.