Enrichment Pipeline¶
Schema transformation reshapes data from one schema to another, but target fields that have no source equivalent remain empty. The enrichment pipeline fills those gaps — deterministically where possible, with LLM assistance where needed, always with cost tracking and provenance. Every feature works without API keys; LLM enrichment is optional. The full design lives in ADR-0026; cost tracking follows ADR-0012.
Overview¶
The enrichment pipeline connects three concepts:
- Placeholder fields on
@node_type— typed gaps in the schema that declare "this field exists but is not yet populated." - Enrichment functions — Python functions registered via
@enrichmentthat know how to fill a specific placeholder. - Batch execution —
trails enrich runiterates unfilled placeholders and calls the registered functions, with cost tracking and PROV-O provenance on every write.
placeholder fields → registered @enrichment functions → trails enrich run → filled data + provenance
Placeholder fields on @node_type¶
A placeholder marks a field as "exists in the schema, not yet filled."
Use the placeholder marker in the field definition:
from trails.orm import node_type, placeholder
@node_type("Article", fields={
"title": str,
"body": str,
"sentiment": (str, placeholder), # not yet filled
"quality_score": (float, placeholder), # not yet filled
"summary": (str, placeholder), # not yet filled
})
class Article:
"""An article with enrichment placeholders."""
pass
Placeholder semantics:
- Nullable by default. Writes that omit placeholder fields succeed without validation errors.
- SHACL export emits
sh:minCount 0for placeholder fields regardless of the type's normal cardinality rules. Astrplaceholder does not generatesh:minCount 1. - Type preserved. The
placeholdermarker is metadata, not a type. The field's actual type (str,float, etc.) is preserved — when a value IS written, it is validated against the declared type. trails doctorreports fill rates:
Placeholder fields are typically created by trails onto transform
--enrich (see transformation guide), but you can
also declare them manually in any @node_type.
The @enrichment decorator¶
Register a function that fills a specific placeholder field on a specific node type:
The decorator takes two required arguments:
| Argument | Description |
|---|---|
target_type |
The @node_type name (e.g., "Article") |
field |
The placeholder field name (e.g., "sentiment") |
And one optional argument:
| Argument | Description |
|---|---|
depends_on |
List of field names that must be filled first |
The decorated function receives two arguments:
| Argument | Description |
|---|---|
node |
The ORM instance (e.g., an Article with node.title, node.body, etc.) |
ctx |
The Trails context with ctx.llm, ctx.kg, ctx.cost |
The function returns the value to write into the placeholder field.
Non-LLM enrichment examples¶
Most enrichment does not need an LLM. Computation, regex, lookups, and derivation from existing fields are deterministic, free, and fast.
Computed from existing fields¶
@enrichment(target_type="Article", field="word_count")
def count_words(node, ctx):
"""Count words in the article body. Pure computation, zero cost."""
if not node.body:
return 0
return len(node.body.split())
Regex extraction¶
import re
@enrichment(target_type="Article", field="first_url")
def extract_first_url(node, ctx):
"""Extract the first URL from the article body using regex."""
match = re.search(r'https?://\S+', node.body or "")
return match.group(0) if match else None
API lookup (no LLM)¶
import httpx
@enrichment(target_type="Article", field="language")
def detect_language(node, ctx):
"""Detect article language using a free API."""
if not node.body:
return None
response = httpx.post(
"https://libretranslate.de/detect",
data={"q": node.body[:500]},
)
results = response.json()
return results[0]["language"] if results else None
Derivation from other node types¶
Enrichment functions can read from any type via ctx.kg:
@enrichment(target_type="Article", field="author_bio")
def lookup_author_bio(node, ctx):
"""Fill author_bio by looking up the Author node."""
authors = ctx.kg.match(type="Author", name=node.author_name)
if authors:
return authors[0].biography
return None
Multi-field derivation with dependencies¶
Use depends_on to ensure fields are filled in the right order:
@enrichment(target_type="Article", field="quality_score",
depends_on=["word_count", "sentiment"])
def compute_quality(node, ctx):
"""Derive quality score from other enriched fields."""
score = 0.0
if node.word_count and node.word_count > 200:
score += 0.5
if node.sentiment and node.sentiment != "negative":
score += 0.3
if node.title and len(node.title) > 10:
score += 0.2
return round(score, 2)
The scheduler resolves the dependency graph and executes enrichment functions in topological order.
LLM enrichment (optional)¶
For fields that require natural-language understanding, enrichment
functions can use ctx.llm. This is optional — it requires a configured
LLM provider and costs tokens.
@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
"""Classify sentiment using LLM. Optional — requires LLM provider."""
response = ctx.llm.complete(
f"Classify the sentiment of this text as positive, negative, "
f"or neutral. Reply with one word only.\n\n{node.body[:500]}",
max_tokens=10,
)
return response.text.strip().lower()
@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
"""Generate a one-sentence summary using LLM."""
response = ctx.llm.complete(
f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
max_tokens=100,
)
return response.text.strip()
LLM enrichment functions behave identically to non-LLM ones. The
@enrichment decorator does not care what the function does internally.
The distinction matters only for cost and determinism.
run_enrichments() Python API¶
from trails.enrichment import run_enrichments
# Run all registered enrichments on all unfilled placeholders
result = run_enrichments(ctx)
print(f"Enriched {result.nodes_processed} nodes")
print(f"Fields filled: {result.fields_filled}")
print(f"Fields skipped: {result.fields_skipped}")
print(f"Total cost: ${result.total_cost_usd:.4f}")
# Run enrichments for a specific type
result = run_enrichments(ctx, target_type="Article")
# Run a specific field only
result = run_enrichments(ctx, target_type="Article", field="sentiment")
# Dry-run: see what would be enriched without executing
result = run_enrichments(ctx, dry_run=True)
for item in result.plan:
print(f" Would enrich {item.node_type}.{item.field} "
f"on {item.unfilled_count} nodes")
# Force re-enrichment of already-filled fields
result = run_enrichments(ctx, force=True)
CLI usage¶
# Run all registered enrichments on unfilled placeholders
trails enrich run
# Run enrichments for a specific type
trails enrich run --type Article
# Run a specific field enrichment
trails enrich run --type Article --field sentiment
# Dry-run: show what would be enriched without executing
trails enrich run --dry-run
# Force re-enrichment of already-filled fields
trails enrich run --force
# Use Batches API for LLM enrichments (50% cost reduction)
trails enrich run --batch
# Show fill rates per placeholder field
trails enrich status
# List all registered enrichment functions
trails enrich list
trails enrich status¶
Shows the fill rate for every placeholder field across all node types:
Article:
sentiment: 23% filled (115 / 500)
quality_score: 0% filled (0 / 500)
summary: 89% filled (445 / 500)
word_count: 100% filled (500 / 500)
Author:
biography: 45% filled (90 / 200)
trails enrich list¶
Shows all registered enrichment functions:
Article.sentiment → analyse_sentiment (LLM)
Article.quality_score → compute_quality (computed, depends_on: [word_count, sentiment])
Article.summary → summarise (LLM)
Article.word_count → count_words (computed)
Author.biography → lookup_author_bio (lookup)
Cost tracking¶
Every trails enrich run invocation opens a CostTracker envelope
tagged enrich:<type>:<field> per ADR-0012. The CLI reports cost at
completion:
Enrichment complete.
Nodes processed: 500
Fields filled: 1385
Cost breakdown:
Article.word_count: $0.0000 (computed)
Article.sentiment: $0.0850 (LLM: haiku, 500 calls)
Article.summary: $0.4200 (LLM: haiku, 500 calls)
Article.quality_score: $0.0000 (computed)
Total cost: $0.5050
Non-LLM enrichment functions report $0.0000 — they are free.
When --batch is passed, LLM enrichment functions are grouped and sent
via the Batches API for a 50% discount.
PROV-O provenance¶
Each enriched value emits a prov:Activity linking:
- The enrichment function (as
prov:Agent) - The filled value (as
prov:Entity) - The source node (as
prov:Entityused by the activity)
The activity records the model used (if any), cost, and timestamp.
trails trace can answer "who filled this field and when?"
# Find provenance for a specific enriched value
SELECT ?activity ?agent ?timestamp ?cost
WHERE {
?activity prov:generated <article-123#sentiment> ;
prov:wasAssociatedWith ?agent ;
prov:startedAtTime ?timestamp .
OPTIONAL { ?activity trails:costUSD ?cost }
}
Execution semantics¶
- Idempotent. Running
trails enrich runtwice skips already-filled fields. Re-enrichment requires--force. - Order-aware. The scheduler resolves
depends_ondeclarations and executes in topological order. Ifquality_scoredepends onsentiment,sentimentruns first. - Batch-aware.
--batchgroups LLM calls for the Batches API (50% discount). Functions are grouped by field. - Cost-tracked. Every invocation opens a cost envelope. The budget
is enforced if
max_cost_per_run_usdis set in the active baseline. - Non-LLM first. The scheduler runs deterministic enrichments (computed, regex, lookup) before LLM enrichments, maximizing fill rate per dollar.
Reference¶
| Symbol | Description |
|---|---|
@enrichment(target_type, field, *, depends_on) |
Register a function as an enrichment for a placeholder field |
run_enrichments(ctx, *, target_type, field, dry_run, force) |
Execute registered enrichments; returns EnrichmentResult |
EnrichmentResult |
.nodes_processed, .fields_filled, .fields_skipped, .total_cost_usd, .plan |
placeholder |
Marker for @node_type field definitions: (str, placeholder) |
See also¶
- ADR-0026 — full transformation and enrichment design
- Schema Transformation — reshape schemas and create placeholder fields
- Baseline Configurations — declare enrichment pipeline requirements in baselines
- ActiveGraph ORM —
@node_typeandplaceholdermarker - LLM Client & Session —
ctx.llmused by LLM enrichments