Agentic Patterns — ReAct vs Plan-and-Execute vs Reflexion¶

This guide is comparative. For the per-strategy reference (prompts, signatures, PROV shape, error paths) read the Agent Runtime guide; this document asks the follow-up question which one should I reach for, given a concrete goal? and answers it head-to-head.

The three planners are drop-in compatible — same run(goal, *, llm, session, ctx=None, max_steps=10, tools=None, ...) signature, same PlanResult shape, same kernel dispatch underneath (ADR-0018). Switching one for another is a one-line import change. The interesting choice is not can I use X? but what does X cost me, and when does it disappoint?

TL;DR decision tree¶

How is "goal met?" judged?
├── step-by-step, next action depends on last observation  →  ReAct
├── end-to-end, one upfront decomposition is obvious       →  Plan-and-Execute
└── end-to-end, correctness needs a second pass            →  Reflexion

Single LLM call enough to produce the answer?  Don't use a planner —
just call LLMClient.complete and save the overhead.

Side-by-side comparison¶

Axis	ReAct	Plan-and-Execute	Reflexion
LLM calls	`N` (one per step)	`1 + R` plan/replan calls + `N` dispatches	`C * (N_inner + 1 critic)`
Latency profile	Sequential thinking; steady drip	Front-loaded (one big plan) then fast	Bursty — each outer iter is a ReAct run plus critic
Replay-friendliness	Good — step trajectory is linear	Best — plans persisted to `session.metadata["plans"]`	Good — critiques persisted to `session.metadata["critiques"]`
Dominant failure mode	Step drift — LLM forgets the goal over many turns	Brittle plan — committing to a wrong decomposition wastes the whole plan before replan	Critic rubber-stamping — cheap "accept" that hides a bad answer
Shines when	Search-then-summarize, classify-then-route, small tool set	3–6 independent sub-tasks with clear order	Rubric answers, soft constraints, silently-wrong is expensive
Disappoints when	Long horizons, branchy workflows	Every step depends on the prior observation	Goal is obvious and correct on first try — pure overhead

Reading this table. The three planners do not dominate each other on cost, latency, or correctness — they trade axes. ReAct pays in LLM calls to avoid committing; Plan-and-Execute pays in plan brittleness to save LLM calls; Reflexion pays a ~2x multiplier to buy correctness it cannot otherwise verify.

Worked example — same goal, three planners¶

Goal. "Find requirements with no test coverage and propose new test cases for them." Assume two capabilities in the registry: reqs.list_uncovered and tests.propose (both pure dispatch).

ReAct trajectory¶

from trails.agent.planners import react
# turn 1: {"action": "reqs.list_uncovered", "action_input": {}}
# turn 2: {"action": "tests.propose", "action_input": {"req_ids": [...from turn 1...]}}
# turn 3: {"action": "finish", "final_answer": "3 proposals for REQ-7, REQ-12, REQ-18"}
result = react.run(goal, llm=client, session=session)

Three LLM calls; the model sees each observation before picking the next action. Natural fit when turn 2's input literally comes from turn 1's output.

Plan-and-Execute trajectory¶

from trails.agent.planners import plan_and_execute
# plan (one LLM call): [reqs.list_uncovered, tests.propose, finish]
# execute: three dispatches, no extra LLM
result = plan_and_execute.run(goal, llm=client, session=session)

One planning call plus the dispatches. Works because the plan shape is obvious before seeing data. If reqs.list_uncovered returned a surprising payload (e.g., 200 uncovered requirements → tests.propose needs batching), the executor triggers _request_plan() again — session.metadata["plans"] grows by one.

Reflexion trajectory¶

from trails.agent.planners import reflexion
# outer iter 1: inner ReAct runs (3 LLM calls) → candidate answer
#               critic call → {"verdict": "retry", "critique": "no coverage % cited"}
# outer iter 2: inner ReAct re-runs with critique in history → stronger answer
#               critic call → {"verdict": "accept"}
result = reflexion.run(goal, llm=client, session=session)

Roughly 2x ReAct's calls when the first attempt needs a rewrite. The payoff: the final answer survives a second look. Skip Reflexion when "3 proposals for REQ-7/12/18" is self-evidently on-goal — the critic turn is wasted.

Planner	LLM calls (happy)	Steps on `PlanResult`	Quality posture
ReAct	3	3 `PlanStep`s	Correct by construction when tool errors don't fire
Plan-and-Execute	1	3 `PlanStep`s	Correct if the upfront plan is
Reflexion	4 (1 iter) or ~8 (2 iters)	3 or 6 `PlanStep`s	Correct after passing a critic

Mock trajectories above mirror the scripted turns in python/tests/test_react_planner.py, test_plan_execute_planner.py, and test_reflexion_planner.py — the _fence() helpers and _scripted_client / _dual_client fixtures there are the runnable versions.

Composition¶

Reflexion already wraps ReAct — that is its whole job (reflexion.run calls react.run once per outer iteration; see python/src/trails/agent/planners/reflexion.py line 349). The useful additional composition is Plan-and-Execute wrapping Reflexion: run Reflexion over a single planned sub-task, treat its critique as a signal to replan the whole top-level plan.

@capability(id="meta.reflex_subtask")
def reflex_subtask(ctx, goal: str) -> dict:
    r = reflexion.run(goal, llm=ctx.llm, session=Session(principal=ctx.principal))
    return {"answer": r.answer, "stopped": r.stopped}
# P&E plan step: {"action": "meta.reflex_subtask", "action_input": {"goal": "..."}}

P&E's replan trigger (TrailsError from dispatch) does not fire on a merely-poor-but-non-error answer; a thin wrapper that raises TrailsError when the Reflexion result carries stopped="max_steps" escalates "critic never accepted" into "replan the outer plan."

Hybrid: when none fits¶

Planners are not free. Each one adds a format-contract prompt, an outer loop, and a retry rung. For tasks that boil down to one templated LLM call plus one kernel call, a hand-written loop built directly on LLMClient.complete is often simpler, cheaper, and easier to audit:

text = llm.complete([Message(role="user", content=prompt)], ctx=ctx).text
envelope = trails.invoke("notes.tag", {"id": note_id, "tag": text.strip()},
                         principal="did:local:alice")

The rule: reach for a planner when you genuinely don't know the next action in advance. If you do know, the planner is overhead that buys nothing. The LLM guide covers the client-level primitives on their own for exactly this case.

Cost analysis¶

With the CostScope dedup wiring in commit 3e99980, planner-step LLM tokens are the authoritative billing row; nested capability-internal LLMClient.complete calls inherit the scope and land as dedupe="child" in CostTracker.records() but are excluded from totals. That means cost comparisons are clean:

ReAct. N step calls × step tokens ≈ total. No critic, no plan call. Empirically the cheapest for goals that fit under max_steps.
Plan-and-Execute. 1 plan call (large, holds the full tool description and observation history) + 0 step LLM calls of its own + up to max_replans (default 3) additional plan calls. For long, clean trajectories this beats ReAct because the plan prompt is billed once, not N times. Under frequent replans it silently overtakes ReAct.
Reflexion. C * N_inner inner tokens + C critic calls where C ≤ max_outer_iterations. Practical budget: 2–3x ReAct on non-trivial goals that actually retry. The critic prompt is small and cacheable, so the multiplier is dominated by re-running the inner ReAct loop.

Rule of thumb: Plan-and-Execute wins when max_replans stays near 0; ReAct wins when plans would be thrown away; Reflexion wins only when answer quality is worth >2x the cheapest option.

Reference — `run(...)` signatures side-by-side¶

react.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
    max_tokens_per_step=1024, temperature=0.0)

plan_and_execute.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
    max_replans=3, max_tokens_per_call=1024, temperature=0.0)

reflexion.run(goal, *, llm, session, ctx=None, max_steps=10, tools=None,
    max_outer_iterations=3, max_tokens_per_call=1024, temperature=0.0)

The only strategy-specific kwargs are max_replans (P&E) and max_outer_iterations (Reflexion). Everything else is uniform by design so A/B harnesses swap strategies without touching call sites.