Skip to content

LLM Client & Session

trails.llm is the framework-owned LLM client. One call surface across four providers (Anthropic, OpenAI, Ollama, mock), with cost envelopes and PROV-O activities wired in automatically when a Context is present. trails.agent.context adds Session — a token-windowed conversation store for multi-turn work. Both primitives land in M9 Phase 1, specified by ADR-0018. The client is deliberately thin: raw complete(...) only. No chains, no DAGs, no planner — planners arrive in Phase 2.

Quickstart

from trails import capability
from trails.llm import LLMClient, Message

# Any provider — same API, one-line swap:
client = LLMClient.ollama(model="qwen3:8b")             # local, free
# client = LLMClient.anthropic(model="claude-sonnet-4-6")  # cloud, paid
# client = LLMClient.openai(model="gpt-4.1-mini")           # cloud, paid
# client = LLMClient.mock(response="A short summary.")     # tests, deterministic

@capability(id="note.summarize")
def summarize(ctx, text: str) -> dict:
    resp = client.complete(
        messages=[Message(role="user", content=f"Summarize:\n{text}")],
        max_tokens=500,
        temperature=0.0,
        ctx=ctx,
    )
    return {"summary": resp.text, "usd": resp.cost_usd}

Passing ctx=ctx is the whole integration: the call is tracked against trails.cost.CostTracker and emits a trails:LLMCompletion activity into the provenance graph. Tests can swap in LLMClient.mock(...) without touching the handler. See the Capabilities guide for what ctx carries.

Providers

All three providers share the same .complete() signature. Swap with a one-line change — no code modifications needed in your capabilities.

Ollama — LLMClient.ollama(...) — local, free

LLMClient.ollama(
    model="qwen3:8b",
    base_url="http://localhost:11434",
    retry=None,
    timeout=30.0,
)

No extra dependencies. The client speaks the Ollama REST API directly via stdlib http.client. Run any model locally — Qwen, Llama, Mistral, Phi, Gemma, or anything Ollama supports. A connection failure surfaces as TrailsError with the hint Is 'ollama serve' running?. HTTP 5xx responses are retried as transient; 4xx responses raise immediately.

ollama pull qwen3:8b    # ~5 GB, runs on most laptops
ollama serve            # start the local server

OpenAI — LLMClient.openai(...) — cloud

LLMClient.openai(
    model="gpt-4.1-mini",
    api_key=None,           # falls back to OPENAI_API_KEY env
    base_url=None,
    cache=False,             # accepted but ignored — OpenAI has no prompt caching
    retry=None,              # RetryPolicy override
    timeout=30.0,
)

Wraps the optional openai SDK. Not installed by default — declare the extra:

pip install 'trails[openai]'

If the SDK is missing, LLMClient.openai() raises TrailsError with the install hint — same pattern as Anthropic.

o-series models (o3, o4-mini). These reasoning models support reasoning_effort ("low", "medium", "high") mapped from the Trails effort parameter. For o-series models, temperature is not sent (they don't support it) and max_completion_tokens is used instead of max_tokens.

client = LLMClient.openai(model="o4-mini")
resp = client.complete(
    messages=[Message(role="user", content="Solve this step by step.")],
    max_tokens=8192,
    effort="high",   # maps to OpenAI reasoning_effort
    ctx=ctx,
)

Anthropic-only parameters (thinking, task_budget) are silently ignored when using the OpenAI provider.

Anthropic — LLMClient.anthropic(...) — cloud

LLMClient.anthropic(
    model="claude-sonnet-4-6",
    api_key=None,           # falls back to ANTHROPIC_API_KEY env
    base_url=None,
    cache=False,             # enable prompt-cache pass-through
    retry=None,              # RetryPolicy override
    timeout=30.0,
)

Wraps the optional anthropic SDK. Not installed by default — declare the extra:

pip install 'trails[llm]'

If the SDK is missing, LLMClient.anthropic() raises TrailsError with the install hint — no opaque ImportError leaks from module import.

Mock — LLMClient.mock(...) — tests

LLMClient.mock(
    response="canned reply",         # str or Callable[[list[Message]], str]
    usage=None,                      # optional LLMUsage override
    cost_usd=0.0,
    stop_reason="end_turn",
    fail_first=0,                    # first N calls raise transient error
)

Use in unit tests. A callable response receives the assembled message list, letting tests assert prompt composition.

Shared call shape

resp = client.complete(
    messages=[Message(role="user", content="...")],
    max_tokens=1024,
    temperature=0.0,
    stop=None,           # optional list of stop sequences
    ctx=ctx,             # optional; enables cost + prov integration
)

messages also accepts plain dicts ({"role": ..., "content": ...}); they are normalized to Message. Empty lists and unknown role strings raise TrailsError.

LLMResponse fields: text, usage (LLMUsage with prompt_tokens / completion_tokens / total_tokens), model, stop_reason, cost_usd, cache_hit, and an opaque raw.

Messages & Session

Message(role, content, cache=False) — frozen dataclass, role is one of system, user, assistant. cache=True is a per-message flag for Anthropic prompt caching; see below.

For chat, ReAct, and multi-turn work, use Session from trails.agent.context:

from trails.agent.context import Session

sess = Session(principal="did:local:alice", max_tokens=32_000, pin_head=1)
sess.append("system", "You are a helpful analyst.")
sess.append("user", "Summarize Q1 filings for SKU-7.")
resp = client.complete(
    messages=[Message(role=m.role, content=m.content) for m in sess.messages()],
    ctx=ctx,
)
sess.append("assistant", resp.text)

Session.history is a TokenWindow: the first pin_head messages are pinned (default 1 — meant for the system prompt), the rest is a FIFO sliding window trimmed until the running total fits inside max_tokens. Eviction returns the dropped messages so callers can log them.

Token accounting uses a heuristic — ceil(words / 0.75), the inverse of "1 token ≈ 0.75 English words". No tokenizer dependency in Phase 1; a pluggable token_counter lands in Phase 2 (ADR-0018 Open Question #2).

When to use which. One-shot transformations (summaries, classification, extraction inside a single capability) → ad-hoc messages=[...]. Chat, ReAct loops, multi-turn reasoning, anything that must remember prior turns → Session. Session is in-memory only in Phase 1; persistence to the KG is Phase 4.

Cost & Provenance integration

Pass ctx to complete() and two things happen automatically:

  1. Cost envelope. Spend is tracked against the module-level CostTracker under capability_id="llm:<model>", tagged with ctx.principal, with usd, total tokens, and latency recorded. This closes the LLM side of ADR-0012 — the dominant cost source is finally inside the primitive.
  2. PROV-O activity. A trails:LLMCompletion node is inserted into the <https://trails.dev/prov/> graph via SPARQL UPDATE, linked to the principal and tagged with model, token counts, cost, and cacheHit. This extends ADR-0009 from the capability boundary to the LLM boundary.

Omit ctx from tests and scripts — neither hook fires. Cost attribution uses a bundled price table for public Anthropic models (USD per 1M tokens, in / out); unknown models fall back to 0.0. Ollama and mock calls always report cost_usd=0.0.

PROV emission is best-effort. If the store denies the write (the kernel owns the prov: graph per ADR-0009 Update ©; the M0 SPARQL escape hatch is still reachable), the client logs a warning and returns the response anyway. A prov failure never kills an LLM call.

Retry & error handling

RetryPolicy defaults: 3 retries, 0.5 s base, 8 s cap, full jitter. Override per client:

from trails.llm import LLMClient, RetryPolicy
client = LLMClient.anthropic(retry=RetryPolicy(max_retries=5, base_delay=1.0))

Transient errors retry automatically: Anthropic SDK errors whose class name contains RateLimit, APIConnection, APITimeout, InternalServer, or ServiceUnavailable; Ollama HTTP 5xx; mock fail_first slots. On retry exhaustion the client raises TrailsError with the provider name and attempt count.

Non-transient errors raise TrailsError immediately — Ollama HTTP 4xx, missing anthropic SDK, bad message shapes, unreachable base URL. Timeouts default to 30 seconds on every network call. Authorization and x-api-key headers are never logged.

Prompt caching (Anthropic)

Anthropic ephemeral prompt caching requires both flags set: LLMClient.anthropic(cache=True) at client construction and Message(..., cache=True) on the specific message blocks to cache. The client emits cache_control: {"type": "ephemeral"} on those blocks only; everything else passes through unmarked.

client = LLMClient.anthropic(model="claude-sonnet-4-6", cache=True)
messages = [
    Message(role="system", content=LONG_SYSTEM_PROMPT, cache=True),
    Message(role="user", content=user_turn),
]
resp = client.complete(messages=messages, ctx=ctx)
resp.cache_hit  # True if the SDK reported cache_read_input_tokens > 0

Long reused system prompts are the target use case — cache hits save upwards of 90 % of input token cost. Ollama and mock ignore cache silently; resp.cache_hit stays False.

Cache TTL

By default, cached prompts expire after 5 minutes. For agentic loops where side-agents or long reasoning chains exceed that window, use 1-hour TTL (2x write cost, but keeps the cache alive longer):

client = LLMClient.anthropic(model="claude-opus-4-7", cache=True, cache_ttl="1h")

Cache usage tracking

LLMUsage now tracks cache metrics separately:

resp = client.complete(messages=messages, ctx=ctx)
print(f"Cache read:     {resp.usage.cache_read_tokens} tokens")
print(f"Cache creation: {resp.usage.cache_creation_tokens} tokens")
print(f"Cache hit:      {resp.cache_hit}")

Adaptive thinking (Anthropic)

Extended thinking gives Claude step-by-step reasoning before answering. Adaptive thinking is the recommended mode — Claude decides when and how much to think based on query complexity. Required for Opus 4.7 (the only mode it supports).

from trails.llm import LLMClient, Message, ThinkingConfig

client = LLMClient.anthropic(model="claude-opus-4-7")

resp = client.complete(
    messages=[Message(role="user", content="Prove that √2 is irrational.")],
    max_tokens=16_000,
    thinking=ThinkingConfig.adaptive(),
    ctx=ctx,
)

# Access thinking blocks (if any — adaptive may skip for simple queries)
for tb in resp.thinking_blocks:
    print(f"Thinking: {tb.thinking[:100]}...")
print(f"Answer: {resp.text}")

Modes:

Mode Constructor Models
Adaptive ThinkingConfig.adaptive() Opus 4.7 (only), Opus 4.6, Sonnet 4.6
Fixed budget ThinkingConfig.enabled(budget_tokens=N) Sonnet 4.5, Opus 4.5 and earlier
Off ThinkingConfig.disabled() All except Mythos

Display control: ThinkingConfig.adaptive(display="omitted") skips streaming thinking text (faster TTFT for pipelines that don't surface reasoning to users). You still pay for full thinking tokens.

With effort: Combine thinking=ThinkingConfig.adaptive() with effort="medium" to reduce thinking on simple queries. Effort tunes depth; thinking enables reasoning; task budgets cap total work.

In planner loops: Thinking blocks from tool-use responses must be passed back to the API for reasoning continuity. With adaptive thinking, interleaved thinking (thinking between tool calls) is automatic.

Task budgets (Anthropic, Opus 4.7+)

Task budgets let you tell Claude how many tokens it has for a full agentic loop — including thinking, tool calls, tool results, and output. The model sees a running countdown and self-regulates to finish gracefully as the budget is consumed.

from trails.llm import LLMClient, Message, TaskBudget

client = LLMClient.anthropic(model="claude-opus-4-7")

resp = client.complete(
    messages=[Message(role="user", content="Audit this codebase.")],
    max_tokens=128_000,
    task_budget=TaskBudget(total=64_000),
    effort="high",
    ctx=ctx,
)
# resp.budget_remaining carries the server-reported remainder (if any).

Key points:

  • TaskBudget(total=N) — advisory budget in tokens. Minimum 20,000.
  • remaining — carry the budget across context compaction: TaskBudget(total=128_000, remaining=128_000 - spent_so_far). Omit when resending full uncompacted history (server tracks it).
  • effort — per-step reasoning depth ("low", "medium", "high"). Effort tunes depth; task budgets tune breadth. Complementary.
  • Advisory, not enforced. Claude may slightly exceed the budget to finish a mid-action step. max_tokens remains the hard cap.
  • Non-Anthropic providers ignore both task_budget and effort silently — the call still works, the parameters are just not sent.
  • Integrates with Trails' cost envelopes: the framework still tracks actual spend via CostTracker regardless of the budget hint.

See: Anthropic task budgets docs

Structured output

Use complete_structured() to get JSON responses that conform to a @shape or @node_type schema. The framework resolves the schema, injects the right provider hints, and validates the response.

With @shape

from trails.llm import LLMClient, Message
from trails.shapes import shape, predicate

@shape(iri="https://myapp.example/ns/Analysis")
class Analysis:
    summary: str = predicate("ex:summary", min_length=10)
    confidence: float = predicate("ex:confidence", min_value=0.0, max_value=1.0)
    tags: list[str] = predicate("ex:tags", min=1)

client = LLMClient.ollama(model="qwen3:8b")
resp = client.complete_structured(
    [Message(role="user", content="Analyze this document: ...")],
    shape_or_schema="https://myapp.example/ns/Analysis",
    ctx=ctx,
)
# resp.text is valid JSON conforming to the Analysis schema.
import json
result = json.loads(resp.text)
print(result["summary"], result["confidence"])

With @node_type

from trails import node_type

@node_type("Finding", fields={"title": str, "severity": int, "description": str})
class Finding:
    pass

resp = client.complete_structured(
    [Message(role="user", content="Find issues in: ...")],
    shape_or_schema=Finding,
    ctx=ctx,
)

With a raw JSON Schema dict

schema = {
    "type": "object",
    "properties": {
        "answer": {"type": "string"},
        "score": {"type": "integer", "minimum": 1, "maximum": 5},
    },
    "required": ["answer", "score"],
}
resp = client.complete_structured(
    [Message(role="user", content="Rate this: ...")],
    shape_or_schema=schema,
)

How it works

  1. shape_or_schema is resolved to a JSON Schema dict (via shape_to_json_schema() or node_type_to_json_schema()).
  2. The schema is passed as response_format={"type": "json_schema", "schema": ...} to complete().
  3. Provider-specific handling:
  4. Anthropic: a system instruction requesting JSON + the schema.
  5. OpenAI: response_format passed to the API directly.
  6. Ollama: format: "json" or the schema dict on the request.
  7. The response text is validated against the schema before returning. Invalid JSON or schema violations raise TrailsError.

You can also use response_format directly on complete() for simpler cases (e.g., response_format="json" for unstructured JSON output).

Batch API

LLMClient.batch() runs multiple completions in one call, returning a BatchResult per request. Each request carries a custom_id for correlation. All providers use sequential fallback in M0; native async batch is a follow-up.

from trails.llm import LLMClient, BatchRequest, Message

client = LLMClient.ollama(model="qwen3:8b")

results = client.batch([
    BatchRequest(
        custom_id="item-1",
        messages=[Message(role="user", content="Summarize: ...")],
        max_tokens=200,
    ),
    BatchRequest(
        custom_id="item-2",
        messages=[Message(role="user", content="Classify: ...")],
        max_tokens=100,
    ),
], ctx=ctx)

for r in results:
    print(r.custom_id, r.response.text if r.response else r.error)

The enrichment pipeline integrates with batch via run_enrichments(batch=True, batch_size=50) and the CLI trails enrich run --batch.

Anti-patterns

  • Hand-rolling HTTP to Anthropic or Ollama. You lose cost envelope, PROV-O, retry, redaction, and timeout defaults. Always go through LLMClient.
  • Forgetting ctx inside a capability. No cost telemetry, no provenance, no principal on the activity. The call still works — silently — which is the worst failure mode.
  • API keys in test fixtures. Use LLMClient.mock(...); its fail_first slot exercises the retry path without network or secrets.
  • Raising cache expectations on Ollama or mock. cache=True is Anthropic-only. Don't assert cache_hit is True for local models.

Reference

trails.llm

Symbol One-line summary
LLMClient.anthropic(*, model, api_key=None, base_url=None, cache=False, retry=None, timeout=30.0) Anthropic-backed client; requires trails[llm] extra.
LLMClient.openai(*, model="gpt-4.1-mini", api_key=None, base_url=None, cache=False, retry=None, timeout=30.0) OpenAI-backed client; requires trails[openai] extra. effort maps to reasoning_effort for o-series models.
LLMClient.ollama(*, model, base_url="http://localhost:11434", retry=None, timeout=30.0) Local Ollama via stdlib HTTP; no extra deps.
LLMClient.mock(*, model="mock:canned", response, usage=None, cost_usd=0.0, stop_reason="end_turn", fail_first=0, retry=None) Deterministic test client.
LLMClient.complete(messages, *, max_tokens=1024, temperature=0.0, stop=None, ctx=None, task_budget=None, effort=None, thinking=None, response_format=None) Run one completion; wires cost + PROV when ctx is set. response_format accepts "json" or {"type": "json_schema", "schema": dict}.
LLMClient.complete_structured(messages, *, shape_or_schema, max_tokens=1024, temperature=0.0, ctx=None) Complete with structured output constrained to a @shape IRI, @node_type class, or raw JSON Schema dict. Validates response.
shape_to_json_schema(shape_id) Convert a registered @shape to a JSON Schema dict. In trails.shapes.
node_type_to_json_schema(cls) Convert a @node_type class to a JSON Schema dict. In trails.shapes.
TaskBudget(total, remaining=None) Advisory token budget for a full agentic loop (Anthropic Opus 4.7+). Minimum total is 20,000.
ThinkingConfig.adaptive(display=None) Adaptive thinking — Claude decides when/how much to reason. Required for Opus 4.7.
ThinkingConfig.enabled(budget_tokens, display=None) Fixed-budget thinking (deprecated on 4.6+, rejected on 4.7).
ThinkingConfig.disabled() No thinking.
ThinkingBlock(thinking, signature=None) One thinking block from a response. signature is opaque — pass it back for multi-turn continuity.
LLMClient.provider / LLMClient.model Read-only identity.
Message(role, content, cache=False) Immutable chat message; role in {system, user, assistant}.
LLMUsage(prompt_tokens, completion_tokens, total_tokens) Token usage breakdown.
LLMResponse(text, usage, model, stop_reason, cost_usd, cache_hit, raw) Normalized provider response.
RetryPolicy(max_retries=3, base_delay=0.5, max_delay=8.0, jitter=True) Exponential-backoff config with full-jitter option.

trails.agent.context

Symbol One-line summary
Session(*, principal, max_tokens=32_000, pin_head=1, auth=None, card=None) Per-run state: history, invocations, auth, WoT card.
Session.append(role, content, *, tokens=None, metadata=None) Append a message; returns evicted messages.
Session.messages() Snapshot of the current window as SessionMessage list.
Session.record_invocation(envelope) Log a capability-invoke envelope into the session.
Session.id / Session.principal / Session.history / Session.invocations / Session.auth / Session.card Read-accessible session state.
SessionMessage(role, content, tokens=0, metadata={}) One history entry; auto-estimates tokens if unset.
TokenWindow(*, max_tokens=32_000, pin_head=1) FIFO window with pinned head and running-total eviction.

See also: the ORM guide for ctx.kg (what capabilities do with the side ctx carries), the Capabilities guide for the dispatch surface LLM calls live inside, and ADR-0018 §Phased delivery for what Phase 2+ will add on top.