Agent Runtime¶
trails.agent is the framework-owned planning loop. @capability +
invoke() is pure dispatch; a planner turns a natural-language goal
into a trajectory of dispatches. M9 Phase 3 ships three strategies —
ReAct, Plan-and-Execute, and Reflexion
(ADR-0018) — behind the
PlanningStrategy protocol so swapping one for another is a one-line
import change. The planner is a thin orchestrator: it schedules
invoke(), does not replace it, and every step lands in the prov:
graph and cost envelope the kernel owns. See the LLM Client &
Session guide for the primitives this loop is built on.
Quickstart¶
from trails import capability
from trails.agent import Session
from trails.agent.planners import react
from trails.llm import LLMClient
@capability(id="notes.search", description="Search notes by tag.")
def search(ctx, tag: str) -> dict:
return {"hits": ["n1", "n2"]}
# Pick any provider — same API:
client = LLMClient.ollama(model="qwen3:8b") # local, free
# client = LLMClient.anthropic(model="claude-sonnet-4-5") # cloud
# client = LLMClient.mock(response="finish") # tests
sess = Session(principal="did:local:alice")
result = react.run("Find urgent notes.", llm=client, session=sess)
print(result.answer, "in", len(result.steps), "steps")
No tool list is passed — the planner walks the live @capability
registry. Pass tools=[...] to scope the set, and ctx=ctx to get
automatic cost attribution and PROV-O linkage.
Planning strategies¶
All three strategies share the same core signature — run(goal, *,
llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None,
max_cost_usd=None, max_tokens=None, max_wall_time_s=None, ...) — so
swapping one for another is a one-line import change. Each has a
handful of strategy-specific kwargs on top (max_replans,
max_outer_iterations) documented below. Budget kwargs
(max_cost_usd / max_tokens / max_wall_time_s) are detailed in
Budget limits; the tools= / tool_filter=
scoping story is in Filtering the tool catalogue.
Pick by trajectory shape. Use ReAct when the next step depends
on the last observation and the task fits under max_steps
(search-then-summarize, classify-then-route). Use Plan-and-Execute
when a clear decomposition exists up front and you want one reasoning
turn to cover many actions — it is cheaper per step on long-horizon
workflows and fails more gracefully on individually-checkable sub-tasks.
Use Reflexion when goal correctness is hard to verify in a single
pass (rubric-matching, soft constraints, answer quality) and a
silently-wrong answer is expensive enough to pay for a critic turn.
ReAct — trails.agent.planners.react¶
Think / act / observe. On every turn the LLM emits a JSON block
containing thought, action, and action_input; the planner
dispatches the action through trails.invoke, feeds the observation
back as the next user message, and loops. action == "finish"
terminates with the final_answer field as PlanResult.answer. Tool
errors (TrailsError from dispatch) are not fatal — they become the
step's observation so the LLM can self-correct.
When to use. Simple goals where the next step depends on the last
observation: search-then-summarize, classify-then-route,
read-then-decide. ReAct is the default when the task fits under
max_steps (default 10) and the LLM can reason about intermediate
results without a global plan. Not for long-horizon trajectories
(> 10 steps) or branch-heavy workflows — prefer Plan-and-Execute
once it lands.
Plan-and-Execute — trails.agent.planners.plan_and_execute¶
Two explicit phases. Plan: one LLM call emits a fenced
```json block holding an ordered plan list whose entries
carry action, action_input, and rationale, and which must
terminate with an action == "finish" step. Execute: the kernel
runs each step through trails.invoke. Replanning is triggered when a
step returns a TrailsError (unknown id, validation failure, policy
denial) — the loop appends the observation to the session, asks the
LLM for a revised plan given what already happened, and resumes.
max_replans (default 3) caps how often that may happen; max_steps
remains the hard ceiling on executed actions across all replans. A
single malformed plan earns one nudge-retry; a second failure
terminates with stopped="error". Each parsed plan is appended to
session.metadata["plans"] for post-hoc inspection.
Signature.
plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10,
tools=None, tool_filter=None, max_replans=3,
max_tokens_per_call=1024, temperature=0.0,
max_cost_usd=None, max_tokens=None, max_wall_time_s=None)
When to use. Long-horizon multi-step workflows with a clear decomposition the LLM can emit in one shot ("ingest 10 files, validate each, export deltas"). Tasks where each step's success is individually checkable, and where one upfront reasoning turn amortises better than a think/act turn per step.
Worked example.
import json
from trails import capability
from trails.agent import Session
from trails.agent.planners import plan_and_execute
from trails.llm import LLMClient
@capability(id="notes.search", description="Search notes by tag.")
def search(ctx, tag: str) -> dict:
return {"hits": ["n1", "n2"]}
plan = {"plan": [
{"action": "notes.search", "action_input": {"tag": "urgent"}, "rationale": "list"},
{"action": "finish", "action_input": {}, "final_answer": "2 urgent notes."},
]}
fenced = "```json\n" + json.dumps(plan) + "\n```"
client = LLMClient.mock(response=lambda _m: fenced)
result = plan_and_execute.run("Find urgent notes.", llm=client, session=Session(principal="did:local:alice"))
See internal planning and ADR-0018 Phase 3 for the design rationale.
Reflexion — trails.agent.planners.reflexion¶
A ReAct inner loop under a Critic outer loop. Each outer iteration
runs react.run(...) once with the supplied max_steps; the resulting
PlanResult is handed to a critic LLM call built on a fresh message
list (so the critic's own prompt does not pollute the agent chat). The
critic replies with a fenced ```json block
{"verdict": "accept" | "retry", "critique": "..."}. On "accept"
the loop terminates with stopped="goal_achieved"; on "retry" the
critique is appended to the session as a user message tagged
[reflexion critique #<n>] so the next inner ReAct pass reads it as
chat history and can avoid the named defect. Outer iteration is
capped by max_outer_iterations (default 3); exhaustion returns the
latest inner result with stopped="max_steps", matching ReAct's
step-budget semantics so A/B comparisons stay uniform. The ordered
list of critic replies lives on session.metadata["critiques"]. A
malformed critic reply earns one nudge-retry; a second failure
terminates with stopped="error". Inner errors propagate unchanged —
no amount of reflection fixes a dispatch crash.
Signature.
reflexion.run(goal, *, llm, session, ctx=None, max_steps=10,
tools=None, tool_filter=None, max_outer_iterations=3,
max_tokens_per_call=1024, temperature=0.0,
max_cost_usd=None, max_tokens=None, max_wall_time_s=None)
When to use. Goal correctness is hard to judge in one pass — answer quality, rubric-matching, soft constraints, tasks where self-critique empirically improves output. The overhead of a critic turn pays off when a silently-wrong answer is expensive.
Worked example.
from trails.agent import Session
from trails.agent.planners import reflexion
from trails.llm import LLMClient, Message
inner = '```json\n{"thought":"done","action":"finish","action_input":{},"final_answer":"ok"}\n```'
critic = '```json\n{"verdict":"accept","critique":""}\n```'
def route(msgs: list[Message]) -> str:
sys = next((m.content for m in msgs if m.role == "system"), "")
return critic if "Reflexion critic" in sys else inner
client = LLMClient.mock(response=route)
result = reflexion.run("Answer the goal.", llm=client, session=Session(principal="did:local:alice"))
See internal planning and ADR-0018 Phase 3 for the design rationale.
Tool discovery¶
Without a tools= kwarg, react.run walks the live
trails.decorators._handlers registry and exposes every registered
@capability. Each tool entry in the system prompt carries its
description, parameter names, type annotations (stringified), and
required/optional flag — enough for the LLM to fill in action_input
without a manifest round-trip. Pass an explicit list to scope:
or a callable (re-evaluated once at run start):
Unknown ids are surfaced to the prompt as (unregistered) rather than
silently dropped — the LLM sees the gap and can reason about it.
Bloat concern. The prompt grows linearly with tool count. Around
50+ capabilities it dominates the input token bill and the LLM's
attention budget; scope tools= to what the goal needs (static list
per route, or a retrieval step before the loop).
Filtering the tool catalogue¶
tools= accepts four shapes (ToolsSpec in
trails.agent.planners.react): None = discover every registered
capability; list[str] = use exactly those ids; Callable[[],
list[str]] = zero-arg callable resolved once at run() entry;
int N = keep the N most-recently-registered capabilities (tail of
the insertion-ordered _handlers dict). bool is explicitly rejected
to prevent the True/False → keep-one-or-zero footgun.
# Explicit static list — deterministic tests, lean prompt.
react.run(goal, llm=c, session=s, tools=["notes.search", "notes.tag"])
# Top-N most recently registered — handy when plugins append tools.
react.run(goal, llm=c, session=s, tools=5)
tool_filter= is a second-stage post-filter applied to the list
produced by tools=. It accepts:
None— no filter.Callable[[dict], bool]— predicate on the per-tool metadata dict (id,description,params,required,annotations,input_shape,output_shape); returnTrueto keep.str— whitespace-separated keywords; a tool is kept iff itsidordescriptioncontains at least one keyword (case-insensitive substring). An empty string passes everything.
# Keyword pre-filter driven by goal text.
react.run(goal, llm=c, session=s, tool_filter="notes search tag")
# Arbitrary predicate — e.g. keep read-only tools only.
react.run(
goal, llm=c, session=s,
tool_filter=lambda meta: "read-only" in meta["annotations"].get("tags", ""),
)
Why filter. Two wins:
- Lean prompt. Every tool entry costs input tokens every turn.
At ~30 discovered capabilities
discover_toolsemits aUserWarningnudging you to filter; at 50+ the catalogue can dominate both cost and the LLM's attention. Scoping to the ~handful a goal actually needs is the single biggest lever on planner cost. - Deterministic tests. A fixed
tools=[...]list shields a test from plugin-order drift and from unrelated capabilities registered by conftest fixtures. The planner sees exactly what the test chose, nothing more.
Both kwargs are available on all three planners (react.run,
plan_and_execute.run, reflexion.run) with identical semantics.
Budget limits¶
Three cumulative-budget kwargs cap a planner run without depending on
max_steps. All three are None (= unlimited) by default and are
available on react.run, plan_and_execute.run, and reflexion.run:
| Kwarg | Breach → PlanResult.stopped |
Sampled from |
|---|---|---|
max_cost_usd: float \| None |
"max_cost" |
cost_tracker.total_usd() |
max_tokens: int \| None |
"max_tokens" |
cost_tracker.total_tokens() (prompt + completion) |
max_wall_time_s: float \| None |
"max_wall_time" |
time.monotonic() since run() entry |
All three read from the shared CostTracker resolved via
trails.llm._module_tracker() — the same tracker every LLMClient.complete
books into, so caps are authoritative across the planner itself and
any capability-internal LLM calls booked during the run. Child rows
that dedupe="child" against a scope's call_id are excluded from
total_usd() / total_tokens(), so a nested LLM call isn't
double-counted against the budget.
When the check runs. At iteration boundaries, not mid-step:
react.run checks before each turn's LLM call; plan_and_execute.run
checks at the top of every planned step and before paying for a
replan; reflexion.run checks at the top of each outer iteration AND
threads the remaining wall-time into the inner ReAct so a long inner
run aborts mid-iteration. Semantics: the in-flight operation finishes,
the loop aborts, a best-effort trails:wasTerminatedBy triple is
written to the prov: graph (logged-and-swallowed on failure), and a
PlanResult is returned.
Result shape on abort.
PlanResult.stoppedcarries the breach reason ("max_cost"/"max_tokens"/"max_wall_time") rather than"goal_achieved".PlanResult.stepsis the trajectory recorded up to the abort — all prior observations are intact.PlanResult.answeris not a final answer. ReAct and Plan-and-Execute stringify the last observation (JSON-encoded when it isn't already astr, viajson.dumps(..., default=str)); Reflexion returns the last inner ReAct run'sanswerunchanged.PlanResult.errorstaysNone— a budget breach is a clean stop, not an error.
There is no separate warning field on PlanResult; callers inspect
stopped to detect an abort. discover_tools emits a UserWarning
(Python warnings) when auto-discovery exceeds the soft cap of 30
capabilities and no tool_filter is set — that is the only
warning-channel surface tied to this loop.
Extracting best-so-far¶
Because answer on a budget-abort is the last observation rather than
a typed final answer, apps that need structured output on abort must
extract it themselves. The canonical recipe:
from trails.agent.planners import react
result = react.run(
goal,
llm=c,
session=s,
max_cost_usd=0.25,
max_wall_time_s=30.0,
)
if result.stopped == "goal_achieved":
final = result.answer # trusted final answer
elif result.stopped in {"max_cost", "max_tokens", "max_wall_time"}:
# Best-so-far: walk the trajectory tail-first for the most recent
# non-error, non-"finish" observation and parse it.
best = None
for step in reversed(result.steps):
obs = step.observation
if obs is None or (isinstance(obs, str) and obs.startswith("error:")):
continue
best = obs
break
# ``result.answer`` is the same observation stringified — cheapest
# fallback when a dict isn't required.
final = best if best is not None else result.answer
else: # "error"
raise RuntimeError(result.error)
Walking result.steps wins over parsing result.answer whenever the
caller needs the raw observation dict (for schema validation, partial
extraction, etc.) instead of its JSON stringification. For
Plan-and-Execute, the same pattern applies; additionally,
session.metadata["plans"] holds every parsed plan in order, so a
caller can recover the un-executed tail of the final plan if it needs
to resume later. For Reflexion, session.metadata["critiques"] holds
the ordered critic replies — useful context for a follow-up run that
starts fresh under a raised budget.
Session and Context¶
Session (from trails.agent.context, covered in the
LLM guide) is the conversation-state
ledger the planner writes into: system prompt, assistant turns,
observations, plus a parallel invocations list of raw envelopes.
Reuse one Session across react.run calls to continue a
conversation; create a fresh one per independent goal.
ctx is the per-invoke trails.context.Context — the same object a
@capability handler receives. Pass ctx=ctx and every LLM call
bills through the cost tracker tagged llm:<model> and emits a
trails:LLMCompletion activity, plus a trails:ReActPlan root
activity linked to each step. Without ctx the loop still runs, but
without telemetry. Tests can omit it; production code should not.
Cost and Provenance¶
When ctx is threaded through, accounting is automatic:
- One LLM call per planner step → one cost envelope and one
trails:LLMCompletionactivity. - One
trails:ReActPlanroot activity perrun(), tagged with goal, principal, andsession.id. - Every
invoke()inside the loop still emits its ownprov:Activityand cost envelope — nothing new on that side.
Open question. A capability called from a ReAct loop may itself
call LLMClient.complete() (e.g., a summarize handler). Both the
planner step and the capability-internal call open cost envelopes;
naïve nesting double-counts the capability's LLM spend. ADR-0018 Open
Question #6 flags this; until Phase 4 resolves it, handlers that do
their own LLM work inside a planner session should subtract their
spend from the parent envelope in reports, or avoid the overlap.
Error handling¶
The loop survives two failure modes cleanly. A malformed LLM
reply (no JSON block, non-dict payload, missing action) earns one
retry with a format-contract nudge appended to the session; a second
malformed turn terminates with stopped="error" and the parse error
in PlanResult.error. A TrailsError from the tool call (unknown
capability id, validation failure, policy denial) becomes the step's
observation ("error: <message>") and the loop continues — the LLM
sees the error and can pick a different tool.
max_steps is a hard budget that counts every turn including the
malformed-reply retry. On exhaustion the planner returns
stopped="max_steps" with the last observation as the best-effort
answer; the trajectory is still intact on result.steps. The
cumulative-budget kwargs (max_cost_usd / max_tokens /
max_wall_time_s) are orthogonal ceilings with matching best-effort
semantics — see Budget limits and
Extracting best-so-far. The only fatal
path is an LLM-provider error bubbled out of LLMClient.complete —
those surface as stopped="error" with the provider message.
Anti-patterns¶
- Exposing every capability by default when the registry is huge.
At 50+ tools the system prompt dominates cost and drowns the LLM's
attention. Scope
tools=to what the goal actually needs. - Long system prompts without Anthropic prompt caching. The ReAct
prompt is a fixed framing re-sent every turn. Pair
LLMClient.anthropic(cache=True)withMessage(cache=True)on the system block to keep re-use free. - Forgetting
ctx. The loop runs, but cost and provenance go silent — the worst failure mode. Forwardctxfrom the outer handler. - New
Sessionper turn when you meant to chain. Rebuilding the session drops history; reuse the instance.
Reference¶
trails.agent¶
| Symbol | One-line summary |
|---|---|
Session(*, principal, max_tokens=32_000, pin_head=1, auth=None, card=None) |
Per-run state: history, invocations, auth, WoT card. |
TokenWindow(*, max_tokens=32_000, pin_head=1) |
FIFO window with pinned head and running-total eviction. |
PlanningStrategy |
Runtime-checkable Protocol every planner satisfies. |
PlanResult(answer, steps, stopped, error=None, strategy="react") |
Terminal output of a planner run. |
PlanStep(thought, action, action_input={}, observation=None, activity_iri=None) |
One think-act-observe turn. |
react |
The ReAct planner module — entry point is react.run(...). |
plan_and_execute |
The Plan-and-Execute planner module — entry point is plan_and_execute.run(...). |
reflexion |
The Reflexion planner module — entry point is reflexion.run(...). |
trails.agent.planners¶
| Symbol | One-line summary |
|---|---|
react.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_tokens_per_step=1024, temperature=0.0, max_cost_usd=None, max_tokens=None, max_wall_time_s=None) |
Run the ReAct loop until finish, max_steps, a budget breach, or a fatal error. |
plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_replans=3, max_tokens_per_call=1024, temperature=0.0, max_cost_usd=None, max_tokens=None, max_wall_time_s=None) |
Plan once, execute each step; replan on TrailsError up to max_replans times. |
reflexion.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, tool_filter=None, max_outer_iterations=3, max_tokens_per_call=1024, temperature=0.0, max_cost_usd=None, max_tokens=None, max_wall_time_s=None) |
Run ReAct under a critic; on retry verdict, append critique and re-run up to max_outer_iterations times. |
react.discover_tools(tools=None, *, tool_filter=None) |
List (capability_id, metadata) pairs, either from an explicit list / callable / int-N / the live registry, optionally post-filtered. |
react.build_system_prompt(goal, tool_specs) |
Assemble the ReAct system prompt (format contract + goal + tool list). |
react.parse_step(text) |
Extract the JSON turn payload from a raw LLM reply; tolerates surrounding prose. |
PlanningStrategy / PlanResult / PlanStep |
Re-exported from trails.agent.planners.base. |
StopReason |
Alias for str: one of "goal_achieved", "max_steps", "max_cost", "max_tokens", "max_wall_time", "error". |
Persistent Memory¶
Agent sessions are ephemeral by default — knowledge is lost when the session ends. For persistent, cross-session knowledge that survives across tools and agents, use Agent Memory (guide, ADR-0051):
# Agent persists a conclusion to shared memory
trails.invoke("memory.learn", {
"content": "The SPARQL proxy rejects SERVICE queries",
"confidence": 0.99,
"topic": "security",
"agent_did": session.auth.did,
"scope": "shared",
})
# Another agent (or same agent, later session) recalls it
trails.invoke("memory.recall", {
"context": "SPARQL security",
"scope": "shared",
})
Memory complements sessions: sessions hold conversation context; memory holds learned knowledge. The two integrate naturally — an agent can persist its high-confidence conclusions to memory at session end.
See also: the LLM Client & Session guide for the client and session primitives the loop is built on, the Capabilities guide for the dispatch surface every action flows through, the Agent Memory guide for persistent cross-agent knowledge, and ADR-0018 §Phased delivery for what Phase 3+ will add under the same planner protocol.