Enrichment Pipeline¶

Schema transformation reshapes data from one schema to another, but target fields that have no source equivalent remain empty. The enrichment pipeline fills those gaps — deterministically where possible, with LLM assistance where needed, always with cost tracking and provenance. Every feature works without API keys; LLM enrichment is optional. The full design lives in ADR-0026; cost tracking follows ADR-0012.

Overview¶

The enrichment pipeline connects three concepts:

Placeholder fields on @node_type — typed gaps in the schema that declare "this field exists but is not yet populated."
Enrichment functions — Python functions registered via @enrichment that know how to fill a specific placeholder.
Batch execution — trails enrich run iterates unfilled placeholders and calls the registered functions, with cost tracking and PROV-O provenance on every write.

placeholder fields → registered @enrichment functions → trails enrich run → filled data + provenance

Placeholder fields on `@node_type`¶

A placeholder marks a field as "exists in the schema, not yet filled." Use the placeholder marker in the field definition:

from trails.orm import node_type, placeholder

@node_type("Article", fields={
    "title": str,
    "body": str,
    "sentiment": (str, placeholder),       # not yet filled
    "quality_score": (float, placeholder),  # not yet filled
    "summary": (str, placeholder),          # not yet filled
})
class Article:
    """An article with enrichment placeholders."""
    pass

Placeholder semantics:

Nullable by default. Writes that omit placeholder fields succeed without validation errors.
SHACL export emits sh:minCount 0 for placeholder fields regardless of the type's normal cardinality rules. A str placeholder does not generate sh:minCount 1.
Type preserved. The placeholder marker is metadata, not a type. The field's actual type (str, float, etc.) is preserved — when a value IS written, it is validated against the declared type.

trails doctor reports fill rates:

Article:
  sentiment:      23% filled (115 / 500)
  quality_score:   0% filled (0 / 500)
  summary:        89% filled (445 / 500)

Placeholder fields are typically created by trails onto transform --enrich (see transformation guide), but you can also declare them manually in any @node_type.

The `@enrichment` decorator¶

Register a function that fills a specific placeholder field on a specific node type:

from trails.enrichment import enrichment

The decorator takes two required arguments:

Argument	Description
`target_type`	The `@node_type` name (e.g., `"Article"`)
`field`	The placeholder field name (e.g., `"sentiment"`)

And one optional argument:

Argument	Description
`depends_on`	List of field names that must be filled first

The decorated function receives two arguments:

Argument	Description
`node`	The ORM instance (e.g., an `Article` with `node.title`, `node.body`, etc.)
`ctx`	The Trails context with `ctx.llm`, `ctx.kg`, `ctx.cost`

The function returns the value to write into the placeholder field.

Non-LLM enrichment examples¶

Most enrichment does not need an LLM. Computation, regex, lookups, and derivation from existing fields are deterministic, free, and fast.

Computed from existing fields¶

@enrichment(target_type="Article", field="word_count")
def count_words(node, ctx):
    """Count words in the article body. Pure computation, zero cost."""
    if not node.body:
        return 0
    return len(node.body.split())

Regex extraction¶

import re

@enrichment(target_type="Article", field="first_url")
def extract_first_url(node, ctx):
    """Extract the first URL from the article body using regex."""
    match = re.search(r'https?://\S+', node.body or "")
    return match.group(0) if match else None

API lookup (no LLM)¶

import httpx

@enrichment(target_type="Article", field="language")
def detect_language(node, ctx):
    """Detect article language using a free API."""
    if not node.body:
        return None
    response = httpx.post(
        "https://libretranslate.de/detect",
        data={"q": node.body[:500]},
    )
    results = response.json()
    return results[0]["language"] if results else None

Derivation from other node types¶

Enrichment functions can read from any type via ctx.kg:

@enrichment(target_type="Article", field="author_bio")
def lookup_author_bio(node, ctx):
    """Fill author_bio by looking up the Author node."""
    authors = ctx.kg.match(type="Author", name=node.author_name)
    if authors:
        return authors[0].biography
    return None

Multi-field derivation with dependencies¶

Use depends_on to ensure fields are filled in the right order:

@enrichment(target_type="Article", field="quality_score",
            depends_on=["word_count", "sentiment"])
def compute_quality(node, ctx):
    """Derive quality score from other enriched fields."""
    score = 0.0
    if node.word_count and node.word_count > 200:
        score += 0.5
    if node.sentiment and node.sentiment != "negative":
        score += 0.3
    if node.title and len(node.title) > 10:
        score += 0.2
    return round(score, 2)

The scheduler resolves the dependency graph and executes enrichment functions in topological order.

LLM enrichment (optional)¶

For fields that require natural-language understanding, enrichment functions can use ctx.llm. This is optional — it requires a configured LLM provider and costs tokens.

@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
    """Classify sentiment using LLM. Optional — requires LLM provider."""
    response = ctx.llm.complete(
        f"Classify the sentiment of this text as positive, negative, "
        f"or neutral. Reply with one word only.\n\n{node.body[:500]}",
        max_tokens=10,
    )
    return response.text.strip().lower()

@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
    """Generate a one-sentence summary using LLM."""
    response = ctx.llm.complete(
        f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
        max_tokens=100,
    )
    return response.text.strip()

LLM enrichment functions behave identically to non-LLM ones. The @enrichment decorator does not care what the function does internally. The distinction matters only for cost and determinism.

`run_enrichments()` Python API¶

from trails.enrichment import run_enrichments

# Run all registered enrichments on all unfilled placeholders
result = run_enrichments(ctx)

print(f"Enriched {result.nodes_processed} nodes")
print(f"Fields filled: {result.fields_filled}")
print(f"Fields skipped: {result.fields_skipped}")
print(f"Total cost: ${result.total_cost_usd:.4f}")

# Run enrichments for a specific type
result = run_enrichments(ctx, target_type="Article")

# Run a specific field only
result = run_enrichments(ctx, target_type="Article", field="sentiment")

# Dry-run: see what would be enriched without executing
result = run_enrichments(ctx, dry_run=True)
for item in result.plan:
    print(f"  Would enrich {item.node_type}.{item.field} "
          f"on {item.unfilled_count} nodes")

# Force re-enrichment of already-filled fields
result = run_enrichments(ctx, force=True)

CLI usage¶

# Run all registered enrichments on unfilled placeholders
trails enrich run

# Run enrichments for a specific type
trails enrich run --type Article

# Run a specific field enrichment
trails enrich run --type Article --field sentiment

# Dry-run: show what would be enriched without executing
trails enrich run --dry-run

# Force re-enrichment of already-filled fields
trails enrich run --force

# Use Batches API for LLM enrichments (50% cost reduction)
trails enrich run --batch

# Show fill rates per placeholder field
trails enrich status

# List all registered enrichment functions
trails enrich list

`trails enrich status`¶

Shows the fill rate for every placeholder field across all node types:

Article:
  sentiment:      23% filled (115 / 500)
  quality_score:   0% filled (0 / 500)
  summary:        89% filled (445 / 500)
  word_count:    100% filled (500 / 500)

Author:
  biography:      45% filled (90 / 200)

`trails enrich list`¶

Shows all registered enrichment functions:

Article.sentiment     → analyse_sentiment    (LLM)
Article.quality_score → compute_quality      (computed, depends_on: [word_count, sentiment])
Article.summary       → summarise            (LLM)
Article.word_count    → count_words          (computed)
Author.biography      → lookup_author_bio    (lookup)

Cost tracking¶

Every trails enrich run invocation opens a CostTracker envelope tagged enrich:<type>:<field> per ADR-0012. The CLI reports cost at completion:

Enrichment complete.
  Nodes processed: 500
  Fields filled:   1385
  Cost breakdown:
    Article.word_count:    $0.0000 (computed)
    Article.sentiment:     $0.0850 (LLM: haiku, 500 calls)
    Article.summary:       $0.4200 (LLM: haiku, 500 calls)
    Article.quality_score: $0.0000 (computed)
  Total cost: $0.5050

Non-LLM enrichment functions report $0.0000 — they are free.

When --batch is passed, LLM enrichment functions are grouped and sent via the Batches API for a 50% discount.

PROV-O provenance¶

Each enriched value emits a prov:Activity linking:

The enrichment function (as prov:Agent)
The filled value (as prov:Entity)
The source node (as prov:Entity used by the activity)

The activity records the model used (if any), cost, and timestamp. trails trace can answer "who filled this field and when?"

# Find provenance for a specific enriched value
SELECT ?activity ?agent ?timestamp ?cost
WHERE {
    ?activity prov:generated <article-123#sentiment> ;
              prov:wasAssociatedWith ?agent ;
              prov:startedAtTime ?timestamp .
    OPTIONAL { ?activity trails:costUSD ?cost }
}

Execution semantics¶

Idempotent. Running trails enrich run twice skips already-filled fields. Re-enrichment requires --force.
Order-aware. The scheduler resolves depends_on declarations and executes in topological order. If quality_score depends on sentiment, sentiment runs first.
Batch-aware. --batch groups LLM calls for the Batches API (50% discount). Functions are grouped by field.
Cost-tracked. Every invocation opens a cost envelope. The budget is enforced if max_cost_per_run_usd is set in the active baseline.
Non-LLM first. The scheduler runs deterministic enrichments (computed, regex, lookup) before LLM enrichments, maximizing fill rate per dollar.

Reference¶

Symbol	Description
`@enrichment(target_type, field, *, depends_on)`	Register a function as an enrichment for a placeholder field
`run_enrichments(ctx, *, target_type, field, dry_run, force)`	Execute registered enrichments; returns `EnrichmentResult`
`EnrichmentResult`	`.nodes_processed`, `.fields_filled`, `.fields_skipped`, `.total_cost_usd`, `.plan`
`placeholder`	Marker for `@node_type` field definitions: `(str, placeholder)`