LLM Client & Session¶
trails.llm is the framework-owned LLM client. One call surface across
four providers (Anthropic, OpenAI, Ollama, mock), with cost envelopes and
PROV-O activities wired in automatically when a Context is present.
trails.agent.context adds Session — a token-windowed conversation
store for multi-turn work. Both primitives land in M9 Phase 1, specified
by ADR-0018. The client is
deliberately thin: raw complete(...) only. No chains, no DAGs, no
planner — planners arrive in Phase 2.
Quickstart¶
from trails import capability
from trails.llm import LLMClient, Message
# Any provider — same API, one-line swap:
client = LLMClient.ollama(model="qwen3:8b") # local, free
# client = LLMClient.anthropic(model="claude-sonnet-4-6") # cloud, paid
# client = LLMClient.openai(model="gpt-4.1-mini") # cloud, paid
# client = LLMClient.mock(response="A short summary.") # tests, deterministic
@capability(id="note.summarize")
def summarize(ctx, text: str) -> dict:
resp = client.complete(
messages=[Message(role="user", content=f"Summarize:\n{text}")],
max_tokens=500,
temperature=0.0,
ctx=ctx,
)
return {"summary": resp.text, "usd": resp.cost_usd}
Passing ctx=ctx is the whole integration: the call is tracked against
trails.cost.CostTracker and emits a trails:LLMCompletion activity
into the provenance graph. Tests can swap in LLMClient.mock(...) without
touching the handler. See the Capabilities guide for
what ctx carries.
Providers¶
All three providers share the same .complete() signature. Swap with a
one-line change — no code modifications needed in your capabilities.
Ollama — LLMClient.ollama(...) — local, free¶
No extra dependencies. The client speaks the Ollama REST API directly
via stdlib http.client. Run any model locally — Qwen, Llama, Mistral,
Phi, Gemma, or anything Ollama supports. A connection failure surfaces as
TrailsError with the hint Is 'ollama serve' running?. HTTP 5xx
responses are retried as transient; 4xx responses raise immediately.
OpenAI — LLMClient.openai(...) — cloud¶
LLMClient.openai(
model="gpt-4.1-mini",
api_key=None, # falls back to OPENAI_API_KEY env
base_url=None,
cache=False, # accepted but ignored — OpenAI has no prompt caching
retry=None, # RetryPolicy override
timeout=30.0,
)
Wraps the optional openai SDK.
Not installed by default — declare the extra:
If the SDK is missing, LLMClient.openai() raises TrailsError with
the install hint — same pattern as Anthropic.
o-series models (o3, o4-mini). These reasoning models support
reasoning_effort ("low", "medium", "high") mapped from the Trails
effort parameter. For o-series models, temperature is not sent
(they don't support it) and max_completion_tokens is used instead
of max_tokens.
client = LLMClient.openai(model="o4-mini")
resp = client.complete(
messages=[Message(role="user", content="Solve this step by step.")],
max_tokens=8192,
effort="high", # maps to OpenAI reasoning_effort
ctx=ctx,
)
Anthropic-only parameters (thinking, task_budget) are silently
ignored when using the OpenAI provider.
Anthropic — LLMClient.anthropic(...) — cloud¶
LLMClient.anthropic(
model="claude-sonnet-4-6",
api_key=None, # falls back to ANTHROPIC_API_KEY env
base_url=None,
cache=False, # enable prompt-cache pass-through
retry=None, # RetryPolicy override
timeout=30.0,
)
Wraps the optional anthropic
SDK. Not installed by default — declare the extra:
If the SDK is missing, LLMClient.anthropic() raises TrailsError
with the install hint — no opaque ImportError leaks from module import.
Mock — LLMClient.mock(...) — tests¶
LLMClient.mock(
response="canned reply", # str or Callable[[list[Message]], str]
usage=None, # optional LLMUsage override
cost_usd=0.0,
stop_reason="end_turn",
fail_first=0, # first N calls raise transient error
)
Use in unit tests. A callable response receives the assembled message
list, letting tests assert prompt composition.
Shared call shape¶
resp = client.complete(
messages=[Message(role="user", content="...")],
max_tokens=1024,
temperature=0.0,
stop=None, # optional list of stop sequences
ctx=ctx, # optional; enables cost + prov integration
)
messages also accepts plain dicts ({"role": ..., "content": ...});
they are normalized to Message. Empty lists and unknown role strings
raise TrailsError.
LLMResponse fields: text, usage (LLMUsage with
prompt_tokens / completion_tokens / total_tokens), model,
stop_reason, cost_usd, cache_hit, and an opaque raw.
Messages & Session¶
Message(role, content, cache=False) — frozen dataclass, role is one of
system, user, assistant. cache=True is a per-message flag for
Anthropic prompt caching; see below.
For chat, ReAct, and multi-turn work, use Session from
trails.agent.context:
from trails.agent.context import Session
sess = Session(principal="did:local:alice", max_tokens=32_000, pin_head=1)
sess.append("system", "You are a helpful analyst.")
sess.append("user", "Summarize Q1 filings for SKU-7.")
resp = client.complete(
messages=[Message(role=m.role, content=m.content) for m in sess.messages()],
ctx=ctx,
)
sess.append("assistant", resp.text)
Session.history is a TokenWindow: the first pin_head messages are
pinned (default 1 — meant for the system prompt), the rest is a FIFO
sliding window trimmed until the running total fits inside
max_tokens. Eviction returns the dropped messages so callers can log
them.
Token accounting uses a heuristic — ceil(words / 0.75), the inverse
of "1 token ≈ 0.75 English words". No tokenizer dependency in Phase 1;
a pluggable token_counter lands in Phase 2 (ADR-0018 Open Question #2).
When to use which. One-shot transformations (summaries, classification,
extraction inside a single capability) → ad-hoc messages=[...].
Chat, ReAct loops, multi-turn reasoning, anything that must remember
prior turns → Session. Session is in-memory only in Phase 1;
persistence to the KG is Phase 4.
Cost & Provenance integration¶
Pass ctx to complete() and two things happen automatically:
- Cost envelope. Spend is tracked against the module-level
CostTrackerundercapability_id="llm:<model>", tagged withctx.principal, withusd, total tokens, and latency recorded. This closes the LLM side of ADR-0012 — the dominant cost source is finally inside the primitive. - PROV-O activity. A
trails:LLMCompletionnode is inserted into the<https://trails.dev/prov/>graph via SPARQL UPDATE, linked to the principal and tagged with model, token counts, cost, andcacheHit. This extends ADR-0009 from the capability boundary to the LLM boundary.
Omit ctx from tests and scripts — neither hook fires. Cost
attribution uses a bundled price table for public Anthropic models
(USD per 1M tokens, in / out); unknown models fall back to 0.0.
Ollama and mock calls always report cost_usd=0.0.
PROV emission is best-effort. If the store denies the write (the
kernel owns the prov: graph per ADR-0009 Update ©; the M0 SPARQL
escape hatch is still reachable), the client logs a warning and returns
the response anyway. A prov failure never kills an LLM call.
Retry & error handling¶
RetryPolicy defaults: 3 retries, 0.5 s base, 8 s cap, full jitter.
Override per client:
from trails.llm import LLMClient, RetryPolicy
client = LLMClient.anthropic(retry=RetryPolicy(max_retries=5, base_delay=1.0))
Transient errors retry automatically: Anthropic SDK errors whose class
name contains RateLimit, APIConnection, APITimeout,
InternalServer, or ServiceUnavailable; Ollama HTTP 5xx; mock
fail_first slots. On retry exhaustion the client raises TrailsError
with the provider name and attempt count.
Non-transient errors raise TrailsError immediately — Ollama HTTP 4xx,
missing anthropic SDK, bad message shapes, unreachable base URL.
Timeouts default to 30 seconds on every network call. Authorization
and x-api-key headers are never logged.
Prompt caching (Anthropic)¶
Anthropic ephemeral prompt caching requires both flags set:
LLMClient.anthropic(cache=True) at client construction and
Message(..., cache=True) on the specific message blocks to cache.
The client emits cache_control: {"type": "ephemeral"} on those blocks
only; everything else passes through unmarked.
client = LLMClient.anthropic(model="claude-sonnet-4-6", cache=True)
messages = [
Message(role="system", content=LONG_SYSTEM_PROMPT, cache=True),
Message(role="user", content=user_turn),
]
resp = client.complete(messages=messages, ctx=ctx)
resp.cache_hit # True if the SDK reported cache_read_input_tokens > 0
Long reused system prompts are the target use case — cache hits save
upwards of 90 % of input token cost. Ollama and mock ignore cache
silently; resp.cache_hit stays False.
Cache TTL¶
By default, cached prompts expire after 5 minutes. For agentic loops where side-agents or long reasoning chains exceed that window, use 1-hour TTL (2x write cost, but keeps the cache alive longer):
Cache usage tracking¶
LLMUsage now tracks cache metrics separately:
resp = client.complete(messages=messages, ctx=ctx)
print(f"Cache read: {resp.usage.cache_read_tokens} tokens")
print(f"Cache creation: {resp.usage.cache_creation_tokens} tokens")
print(f"Cache hit: {resp.cache_hit}")
Adaptive thinking (Anthropic)¶
Extended thinking gives Claude step-by-step reasoning before answering. Adaptive thinking is the recommended mode — Claude decides when and how much to think based on query complexity. Required for Opus 4.7 (the only mode it supports).
from trails.llm import LLMClient, Message, ThinkingConfig
client = LLMClient.anthropic(model="claude-opus-4-7")
resp = client.complete(
messages=[Message(role="user", content="Prove that √2 is irrational.")],
max_tokens=16_000,
thinking=ThinkingConfig.adaptive(),
ctx=ctx,
)
# Access thinking blocks (if any — adaptive may skip for simple queries)
for tb in resp.thinking_blocks:
print(f"Thinking: {tb.thinking[:100]}...")
print(f"Answer: {resp.text}")
Modes:
| Mode | Constructor | Models |
|---|---|---|
| Adaptive | ThinkingConfig.adaptive() |
Opus 4.7 (only), Opus 4.6, Sonnet 4.6 |
| Fixed budget | ThinkingConfig.enabled(budget_tokens=N) |
Sonnet 4.5, Opus 4.5 and earlier |
| Off | ThinkingConfig.disabled() |
All except Mythos |
Display control: ThinkingConfig.adaptive(display="omitted") skips
streaming thinking text (faster TTFT for pipelines that don't surface
reasoning to users). You still pay for full thinking tokens.
With effort: Combine thinking=ThinkingConfig.adaptive() with
effort="medium" to reduce thinking on simple queries. Effort tunes
depth; thinking enables reasoning; task budgets cap total work.
In planner loops: Thinking blocks from tool-use responses must be passed back to the API for reasoning continuity. With adaptive thinking, interleaved thinking (thinking between tool calls) is automatic.
Task budgets (Anthropic, Opus 4.7+)¶
Task budgets let you tell Claude how many tokens it has for a full agentic loop — including thinking, tool calls, tool results, and output. The model sees a running countdown and self-regulates to finish gracefully as the budget is consumed.
from trails.llm import LLMClient, Message, TaskBudget
client = LLMClient.anthropic(model="claude-opus-4-7")
resp = client.complete(
messages=[Message(role="user", content="Audit this codebase.")],
max_tokens=128_000,
task_budget=TaskBudget(total=64_000),
effort="high",
ctx=ctx,
)
# resp.budget_remaining carries the server-reported remainder (if any).
Key points:
TaskBudget(total=N)— advisory budget in tokens. Minimum 20,000.remaining— carry the budget across context compaction:TaskBudget(total=128_000, remaining=128_000 - spent_so_far). Omit when resending full uncompacted history (server tracks it).effort— per-step reasoning depth ("low","medium","high"). Effort tunes depth; task budgets tune breadth. Complementary.- Advisory, not enforced. Claude may slightly exceed the budget to
finish a mid-action step.
max_tokensremains the hard cap. - Non-Anthropic providers ignore both
task_budgetandeffortsilently — the call still works, the parameters are just not sent. - Integrates with Trails' cost envelopes: the framework still tracks
actual spend via
CostTrackerregardless of the budget hint.
See: Anthropic task budgets docs
Structured output¶
Use complete_structured() to get JSON responses that conform to a
@shape or @node_type schema. The framework resolves the schema,
injects the right provider hints, and validates the response.
With @shape¶
from trails.llm import LLMClient, Message
from trails.shapes import shape, predicate
@shape(iri="https://myapp.example/ns/Analysis")
class Analysis:
summary: str = predicate("ex:summary", min_length=10)
confidence: float = predicate("ex:confidence", min_value=0.0, max_value=1.0)
tags: list[str] = predicate("ex:tags", min=1)
client = LLMClient.ollama(model="qwen3:8b")
resp = client.complete_structured(
[Message(role="user", content="Analyze this document: ...")],
shape_or_schema="https://myapp.example/ns/Analysis",
ctx=ctx,
)
# resp.text is valid JSON conforming to the Analysis schema.
import json
result = json.loads(resp.text)
print(result["summary"], result["confidence"])
With @node_type¶
from trails import node_type
@node_type("Finding", fields={"title": str, "severity": int, "description": str})
class Finding:
pass
resp = client.complete_structured(
[Message(role="user", content="Find issues in: ...")],
shape_or_schema=Finding,
ctx=ctx,
)
With a raw JSON Schema dict¶
schema = {
"type": "object",
"properties": {
"answer": {"type": "string"},
"score": {"type": "integer", "minimum": 1, "maximum": 5},
},
"required": ["answer", "score"],
}
resp = client.complete_structured(
[Message(role="user", content="Rate this: ...")],
shape_or_schema=schema,
)
How it works¶
shape_or_schemais resolved to a JSON Schema dict (viashape_to_json_schema()ornode_type_to_json_schema()).- The schema is passed as
response_format={"type": "json_schema", "schema": ...}tocomplete(). - Provider-specific handling:
- Anthropic: a system instruction requesting JSON + the schema.
- OpenAI:
response_formatpassed to the API directly. - Ollama:
format: "json"or the schema dict on the request. - The response text is validated against the schema before returning.
Invalid JSON or schema violations raise
TrailsError.
You can also use response_format directly on complete() for simpler
cases (e.g., response_format="json" for unstructured JSON output).
Batch API¶
LLMClient.batch() runs multiple completions in one call, returning a
BatchResult per request. Each request carries a custom_id for
correlation. All providers use sequential fallback in M0; native async
batch is a follow-up.
from trails.llm import LLMClient, BatchRequest, Message
client = LLMClient.ollama(model="qwen3:8b")
results = client.batch([
BatchRequest(
custom_id="item-1",
messages=[Message(role="user", content="Summarize: ...")],
max_tokens=200,
),
BatchRequest(
custom_id="item-2",
messages=[Message(role="user", content="Classify: ...")],
max_tokens=100,
),
], ctx=ctx)
for r in results:
print(r.custom_id, r.response.text if r.response else r.error)
The enrichment pipeline integrates with batch via
run_enrichments(batch=True, batch_size=50) and the CLI
trails enrich run --batch.
Anti-patterns¶
- Hand-rolling HTTP to Anthropic or Ollama. You lose cost envelope,
PROV-O, retry, redaction, and timeout defaults. Always go through
LLMClient. - Forgetting
ctxinside a capability. No cost telemetry, no provenance, no principal on the activity. The call still works — silently — which is the worst failure mode. - API keys in test fixtures. Use
LLMClient.mock(...); itsfail_firstslot exercises the retry path without network or secrets. - Raising cache expectations on Ollama or mock.
cache=Trueis Anthropic-only. Don't assertcache_hitisTruefor local models.
Reference¶
trails.llm¶
| Symbol | One-line summary |
|---|---|
LLMClient.anthropic(*, model, api_key=None, base_url=None, cache=False, retry=None, timeout=30.0) |
Anthropic-backed client; requires trails[llm] extra. |
LLMClient.openai(*, model="gpt-4.1-mini", api_key=None, base_url=None, cache=False, retry=None, timeout=30.0) |
OpenAI-backed client; requires trails[openai] extra. effort maps to reasoning_effort for o-series models. |
LLMClient.ollama(*, model, base_url="http://localhost:11434", retry=None, timeout=30.0) |
Local Ollama via stdlib HTTP; no extra deps. |
LLMClient.mock(*, model="mock:canned", response, usage=None, cost_usd=0.0, stop_reason="end_turn", fail_first=0, retry=None) |
Deterministic test client. |
LLMClient.complete(messages, *, max_tokens=1024, temperature=0.0, stop=None, ctx=None, task_budget=None, effort=None, thinking=None, response_format=None) |
Run one completion; wires cost + PROV when ctx is set. response_format accepts "json" or {"type": "json_schema", "schema": dict}. |
LLMClient.complete_structured(messages, *, shape_or_schema, max_tokens=1024, temperature=0.0, ctx=None) |
Complete with structured output constrained to a @shape IRI, @node_type class, or raw JSON Schema dict. Validates response. |
shape_to_json_schema(shape_id) |
Convert a registered @shape to a JSON Schema dict. In trails.shapes. |
node_type_to_json_schema(cls) |
Convert a @node_type class to a JSON Schema dict. In trails.shapes. |
TaskBudget(total, remaining=None) |
Advisory token budget for a full agentic loop (Anthropic Opus 4.7+). Minimum total is 20,000. |
ThinkingConfig.adaptive(display=None) |
Adaptive thinking — Claude decides when/how much to reason. Required for Opus 4.7. |
ThinkingConfig.enabled(budget_tokens, display=None) |
Fixed-budget thinking (deprecated on 4.6+, rejected on 4.7). |
ThinkingConfig.disabled() |
No thinking. |
ThinkingBlock(thinking, signature=None) |
One thinking block from a response. signature is opaque — pass it back for multi-turn continuity. |
LLMClient.provider / LLMClient.model |
Read-only identity. |
Message(role, content, cache=False) |
Immutable chat message; role in {system, user, assistant}. |
LLMUsage(prompt_tokens, completion_tokens, total_tokens) |
Token usage breakdown. |
LLMResponse(text, usage, model, stop_reason, cost_usd, cache_hit, raw) |
Normalized provider response. |
RetryPolicy(max_retries=3, base_delay=0.5, max_delay=8.0, jitter=True) |
Exponential-backoff config with full-jitter option. |
trails.agent.context¶
| Symbol | One-line summary |
|---|---|
Session(*, principal, max_tokens=32_000, pin_head=1, auth=None, card=None) |
Per-run state: history, invocations, auth, WoT card. |
Session.append(role, content, *, tokens=None, metadata=None) |
Append a message; returns evicted messages. |
Session.messages() |
Snapshot of the current window as SessionMessage list. |
Session.record_invocation(envelope) |
Log a capability-invoke envelope into the session. |
Session.id / Session.principal / Session.history / Session.invocations / Session.auth / Session.card |
Read-accessible session state. |
SessionMessage(role, content, tokens=0, metadata={}) |
One history entry; auto-estimates tokens if unset. |
TokenWindow(*, max_tokens=32_000, pin_head=1) |
FIFO window with pinned head and running-total eviction. |
See also: the ORM guide for ctx.kg (what capabilities do
with the side ctx carries), the Capabilities guide
for the dispatch surface LLM calls live inside, and
ADR-0018 §Phased delivery for
what Phase 2+ will add on top.