Agentic Patterns — ReAct vs Plan-and-Execute vs Reflexion¶
This guide is comparative. For the per-strategy reference (prompts, signatures, PROV shape, error paths) read the Agent Runtime guide; this document asks the follow-up question which one should I reach for, given a concrete goal? and answers it head-to-head.
The three planners are drop-in compatible — same run(goal, *, llm,
session, ctx=None, max_steps=10, tools=None, ...) signature, same
PlanResult shape, same kernel dispatch underneath
(ADR-0018). Switching one for
another is a one-line import change. The interesting choice is not
can I use X? but what does X cost me, and when does it disappoint?
TL;DR decision tree¶
How is "goal met?" judged?
├── step-by-step, next action depends on last observation → ReAct
├── end-to-end, one upfront decomposition is obvious → Plan-and-Execute
└── end-to-end, correctness needs a second pass → Reflexion
Single LLM call enough to produce the answer? Don't use a planner —
just call LLMClient.complete and save the overhead.
Side-by-side comparison¶
| Axis | ReAct | Plan-and-Execute | Reflexion |
|---|---|---|---|
| LLM calls | N (one per step) |
1 + R plan/replan calls + N dispatches |
C * (N_inner + 1 critic) |
| Latency profile | Sequential thinking; steady drip | Front-loaded (one big plan) then fast | Bursty — each outer iter is a ReAct run plus critic |
| Replay-friendliness | Good — step trajectory is linear | Best — plans persisted to session.metadata["plans"] |
Good — critiques persisted to session.metadata["critiques"] |
| Dominant failure mode | Step drift — LLM forgets the goal over many turns | Brittle plan — committing to a wrong decomposition wastes the whole plan before replan | Critic rubber-stamping — cheap "accept" that hides a bad answer |
| Shines when | Search-then-summarize, classify-then-route, small tool set | 3–6 independent sub-tasks with clear order | Rubric answers, soft constraints, silently-wrong is expensive |
| Disappoints when | Long horizons, branchy workflows | Every step depends on the prior observation | Goal is obvious and correct on first try — pure overhead |
Reading this table. The three planners do not dominate each other on cost, latency, or correctness — they trade axes. ReAct pays in LLM calls to avoid committing; Plan-and-Execute pays in plan brittleness to save LLM calls; Reflexion pays a ~2x multiplier to buy correctness it cannot otherwise verify.
Worked example — same goal, three planners¶
Goal. "Find requirements with no test coverage and propose new
test cases for them." Assume two capabilities in the registry:
reqs.list_uncovered and tests.propose (both pure dispatch).
ReAct trajectory¶
from trails.agent.planners import react
# turn 1: {"action": "reqs.list_uncovered", "action_input": {}}
# turn 2: {"action": "tests.propose", "action_input": {"req_ids": [...from turn 1...]}}
# turn 3: {"action": "finish", "final_answer": "3 proposals for REQ-7, REQ-12, REQ-18"}
result = react.run(goal, llm=client, session=session)
Plan-and-Execute trajectory¶
from trails.agent.planners import plan_and_execute
# plan (one LLM call): [reqs.list_uncovered, tests.propose, finish]
# execute: three dispatches, no extra LLM
result = plan_and_execute.run(goal, llm=client, session=session)
reqs.list_uncovered returned a
surprising payload (e.g., 200 uncovered requirements → tests.propose
needs batching), the executor triggers _request_plan() again —
session.metadata["plans"] grows by one.
Reflexion trajectory¶
from trails.agent.planners import reflexion
# outer iter 1: inner ReAct runs (3 LLM calls) → candidate answer
# critic call → {"verdict": "retry", "critique": "no coverage % cited"}
# outer iter 2: inner ReAct re-runs with critique in history → stronger answer
# critic call → {"verdict": "accept"}
result = reflexion.run(goal, llm=client, session=session)
| Planner | LLM calls (happy) | Steps on PlanResult |
Quality posture |
|---|---|---|---|
| ReAct | 3 | 3 PlanSteps |
Correct by construction when tool errors don't fire |
| Plan-and-Execute | 1 | 3 PlanSteps |
Correct if the upfront plan is |
| Reflexion | 4 (1 iter) or ~8 (2 iters) | 3 or 6 PlanSteps |
Correct after passing a critic |
Mock trajectories above mirror the scripted turns in
python/tests/test_react_planner.py, test_plan_execute_planner.py,
and test_reflexion_planner.py — the _fence() helpers and
_scripted_client / _dual_client fixtures there are the
runnable versions.
Composition¶
Reflexion already wraps ReAct — that is its whole job
(reflexion.run calls react.run once per outer iteration; see
python/src/trails/agent/planners/reflexion.py line 349). The useful
additional composition is Plan-and-Execute wrapping Reflexion:
run Reflexion over a single planned sub-task, treat its critique as
a signal to replan the whole top-level plan.
@capability(id="meta.reflex_subtask")
def reflex_subtask(ctx, goal: str) -> dict:
r = reflexion.run(goal, llm=ctx.llm, session=Session(principal=ctx.principal))
return {"answer": r.answer, "stopped": r.stopped}
# P&E plan step: {"action": "meta.reflex_subtask", "action_input": {"goal": "..."}}
TrailsError from dispatch) does not fire on a
merely-poor-but-non-error answer; a thin wrapper that raises
TrailsError when the Reflexion result carries stopped="max_steps"
escalates "critic never accepted" into "replan the outer plan."
Hybrid: when none fits¶
Planners are not free. Each one adds a format-contract prompt, an
outer loop, and a retry rung. For tasks that boil down to one
templated LLM call plus one kernel call, a hand-written loop built
directly on LLMClient.complete is often simpler, cheaper, and
easier to audit:
text = llm.complete([Message(role="user", content=prompt)], ctx=ctx).text
envelope = trails.invoke("notes.tag", {"id": note_id, "tag": text.strip()},
principal="did:local:alice")
The rule: reach for a planner when you genuinely don't know the next action in advance. If you do know, the planner is overhead that buys nothing. The LLM guide covers the client-level primitives on their own for exactly this case.
Cost analysis¶
With the CostScope dedup wiring in commit 3e99980, planner-step
LLM tokens are the authoritative billing row; nested
capability-internal LLMClient.complete calls inherit the scope and
land as dedupe="child" in CostTracker.records() but are excluded
from totals. That means cost comparisons are clean:
- ReAct.
Nstep calls × step tokens ≈ total. No critic, no plan call. Empirically the cheapest for goals that fit undermax_steps. - Plan-and-Execute.
1plan call (large, holds the full tool description and observation history) +0step LLM calls of its own + up tomax_replans(default 3) additional plan calls. For long, clean trajectories this beats ReAct because the plan prompt is billed once, notNtimes. Under frequent replans it silently overtakes ReAct. - Reflexion.
C * N_innerinner tokens +Ccritic calls whereC ≤ max_outer_iterations. Practical budget: 2–3x ReAct on non-trivial goals that actually retry. The critic prompt is small and cacheable, so the multiplier is dominated by re-running the inner ReAct loop.
Rule of thumb: Plan-and-Execute wins when max_replans stays near 0;
ReAct wins when plans would be thrown away; Reflexion wins only when
answer quality is worth >2x the cheapest option.
Reference — run(...) signatures side-by-side¶
react.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
max_tokens_per_step=1024, temperature=0.0)
plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
max_replans=3, max_tokens_per_call=1024, temperature=0.0)
reflexion.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
max_outer_iterations=3, max_tokens_per_call=1024, temperature=0.0)
The only strategy-specific kwargs are max_replans (P&E) and
max_outer_iterations (Reflexion). Everything else is uniform by
design so A/B harnesses swap strategies without touching call sites.
See also¶
- Agent Runtime guide — per-strategy prompts, parsing, PROV-O shape, error paths.
- LLM Client & Session guide — the primitives every planner is built on; also the right choice for sub-planner tasks.
- ADR-0018 — the design source.
- Sprint: ReAct planner — Phase 2 delivery record.
- Sprint: Plan-and-Execute + Reflexion — Phase 3 delivery record.