Skip to content

Agentic Patterns — ReAct vs Plan-and-Execute vs Reflexion

This guide is comparative. For the per-strategy reference (prompts, signatures, PROV shape, error paths) read the Agent Runtime guide; this document asks the follow-up question which one should I reach for, given a concrete goal? and answers it head-to-head.

The three planners are drop-in compatible — same run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, ...) signature, same PlanResult shape, same kernel dispatch underneath (ADR-0018). Switching one for another is a one-line import change. The interesting choice is not can I use X? but what does X cost me, and when does it disappoint?

TL;DR decision tree

How is "goal met?" judged?
├── step-by-step, next action depends on last observation  →  ReAct
├── end-to-end, one upfront decomposition is obvious       →  Plan-and-Execute
└── end-to-end, correctness needs a second pass            →  Reflexion

Single LLM call enough to produce the answer?  Don't use a planner —
just call LLMClient.complete and save the overhead.

Side-by-side comparison

Axis ReAct Plan-and-Execute Reflexion
LLM calls N (one per step) 1 + R plan/replan calls + N dispatches C * (N_inner + 1 critic)
Latency profile Sequential thinking; steady drip Front-loaded (one big plan) then fast Bursty — each outer iter is a ReAct run plus critic
Replay-friendliness Good — step trajectory is linear Best — plans persisted to session.metadata["plans"] Good — critiques persisted to session.metadata["critiques"]
Dominant failure mode Step drift — LLM forgets the goal over many turns Brittle plan — committing to a wrong decomposition wastes the whole plan before replan Critic rubber-stamping — cheap "accept" that hides a bad answer
Shines when Search-then-summarize, classify-then-route, small tool set 3–6 independent sub-tasks with clear order Rubric answers, soft constraints, silently-wrong is expensive
Disappoints when Long horizons, branchy workflows Every step depends on the prior observation Goal is obvious and correct on first try — pure overhead

Reading this table. The three planners do not dominate each other on cost, latency, or correctness — they trade axes. ReAct pays in LLM calls to avoid committing; Plan-and-Execute pays in plan brittleness to save LLM calls; Reflexion pays a ~2x multiplier to buy correctness it cannot otherwise verify.

Worked example — same goal, three planners

Goal. "Find requirements with no test coverage and propose new test cases for them." Assume two capabilities in the registry: reqs.list_uncovered and tests.propose (both pure dispatch).

ReAct trajectory

from trails.agent.planners import react
# turn 1: {"action": "reqs.list_uncovered", "action_input": {}}
# turn 2: {"action": "tests.propose", "action_input": {"req_ids": [...from turn 1...]}}
# turn 3: {"action": "finish", "final_answer": "3 proposals for REQ-7, REQ-12, REQ-18"}
result = react.run(goal, llm=client, session=session)
Three LLM calls; the model sees each observation before picking the next action. Natural fit when turn 2's input literally comes from turn 1's output.

Plan-and-Execute trajectory

from trails.agent.planners import plan_and_execute
# plan (one LLM call): [reqs.list_uncovered, tests.propose, finish]
# execute: three dispatches, no extra LLM
result = plan_and_execute.run(goal, llm=client, session=session)
One planning call plus the dispatches. Works because the plan shape is obvious before seeing data. If reqs.list_uncovered returned a surprising payload (e.g., 200 uncovered requirements → tests.propose needs batching), the executor triggers _request_plan() again — session.metadata["plans"] grows by one.

Reflexion trajectory

from trails.agent.planners import reflexion
# outer iter 1: inner ReAct runs (3 LLM calls) → candidate answer
#               critic call → {"verdict": "retry", "critique": "no coverage % cited"}
# outer iter 2: inner ReAct re-runs with critique in history → stronger answer
#               critic call → {"verdict": "accept"}
result = reflexion.run(goal, llm=client, session=session)
Roughly 2x ReAct's calls when the first attempt needs a rewrite. The payoff: the final answer survives a second look. Skip Reflexion when "3 proposals for REQ-7/12/18" is self-evidently on-goal — the critic turn is wasted.

Planner LLM calls (happy) Steps on PlanResult Quality posture
ReAct 3 3 PlanSteps Correct by construction when tool errors don't fire
Plan-and-Execute 1 3 PlanSteps Correct if the upfront plan is
Reflexion 4 (1 iter) or ~8 (2 iters) 3 or 6 PlanSteps Correct after passing a critic

Mock trajectories above mirror the scripted turns in python/tests/test_react_planner.py, test_plan_execute_planner.py, and test_reflexion_planner.py — the _fence() helpers and _scripted_client / _dual_client fixtures there are the runnable versions.

Composition

Reflexion already wraps ReAct — that is its whole job (reflexion.run calls react.run once per outer iteration; see python/src/trails/agent/planners/reflexion.py line 349). The useful additional composition is Plan-and-Execute wrapping Reflexion: run Reflexion over a single planned sub-task, treat its critique as a signal to replan the whole top-level plan.

@capability(id="meta.reflex_subtask")
def reflex_subtask(ctx, goal: str) -> dict:
    r = reflexion.run(goal, llm=ctx.llm, session=Session(principal=ctx.principal))
    return {"answer": r.answer, "stopped": r.stopped}
# P&E plan step: {"action": "meta.reflex_subtask", "action_input": {"goal": "..."}}
P&E's replan trigger (TrailsError from dispatch) does not fire on a merely-poor-but-non-error answer; a thin wrapper that raises TrailsError when the Reflexion result carries stopped="max_steps" escalates "critic never accepted" into "replan the outer plan."

Hybrid: when none fits

Planners are not free. Each one adds a format-contract prompt, an outer loop, and a retry rung. For tasks that boil down to one templated LLM call plus one kernel call, a hand-written loop built directly on LLMClient.complete is often simpler, cheaper, and easier to audit:

text = llm.complete([Message(role="user", content=prompt)], ctx=ctx).text
envelope = trails.invoke("notes.tag", {"id": note_id, "tag": text.strip()},
                         principal="did:local:alice")

The rule: reach for a planner when you genuinely don't know the next action in advance. If you do know, the planner is overhead that buys nothing. The LLM guide covers the client-level primitives on their own for exactly this case.

Cost analysis

With the CostScope dedup wiring in commit 3e99980, planner-step LLM tokens are the authoritative billing row; nested capability-internal LLMClient.complete calls inherit the scope and land as dedupe="child" in CostTracker.records() but are excluded from totals. That means cost comparisons are clean:

  • ReAct. N step calls × step tokens ≈ total. No critic, no plan call. Empirically the cheapest for goals that fit under max_steps.
  • Plan-and-Execute. 1 plan call (large, holds the full tool description and observation history) + 0 step LLM calls of its own + up to max_replans (default 3) additional plan calls. For long, clean trajectories this beats ReAct because the plan prompt is billed once, not N times. Under frequent replans it silently overtakes ReAct.
  • Reflexion. C * N_inner inner tokens + C critic calls where C ≤ max_outer_iterations. Practical budget: 2–3x ReAct on non-trivial goals that actually retry. The critic prompt is small and cacheable, so the multiplier is dominated by re-running the inner ReAct loop.

Rule of thumb: Plan-and-Execute wins when max_replans stays near 0; ReAct wins when plans would be thrown away; Reflexion wins only when answer quality is worth >2x the cheapest option.

Reference — run(...) signatures side-by-side

react.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
    max_tokens_per_step=1024, temperature=0.0)

plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
    max_replans=3, max_tokens_per_call=1024, temperature=0.0)

reflexion.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
    max_outer_iterations=3, max_tokens_per_call=1024, temperature=0.0)

The only strategy-specific kwargs are max_replans (P&E) and max_outer_iterations (Reflexion). Everything else is uniform by design so A/B harnesses swap strategies without touching call sites.

See also