Skip to content

Enrichment Pipeline

Schema transformation reshapes data from one schema to another, but target fields that have no source equivalent remain empty. The enrichment pipeline fills those gaps — deterministically where possible, with LLM assistance where needed, always with cost tracking and provenance. Every feature works without API keys; LLM enrichment is optional. The full design lives in ADR-0026; cost tracking follows ADR-0012.

Overview

The enrichment pipeline connects three concepts:

  1. Placeholder fields on @node_type — typed gaps in the schema that declare "this field exists but is not yet populated."
  2. Enrichment functions — Python functions registered via @enrichment that know how to fill a specific placeholder.
  3. Batch executiontrails enrich run iterates unfilled placeholders and calls the registered functions, with cost tracking and PROV-O provenance on every write.
placeholder fields → registered @enrichment functions → trails enrich run → filled data + provenance

Placeholder fields on @node_type

A placeholder marks a field as "exists in the schema, not yet filled." Use the placeholder marker in the field definition:

from trails.orm import node_type, placeholder

@node_type("Article", fields={
    "title": str,
    "body": str,
    "sentiment": (str, placeholder),       # not yet filled
    "quality_score": (float, placeholder),  # not yet filled
    "summary": (str, placeholder),          # not yet filled
})
class Article:
    """An article with enrichment placeholders."""
    pass

Placeholder semantics:

  • Nullable by default. Writes that omit placeholder fields succeed without validation errors.
  • SHACL export emits sh:minCount 0 for placeholder fields regardless of the type's normal cardinality rules. A str placeholder does not generate sh:minCount 1.
  • Type preserved. The placeholder marker is metadata, not a type. The field's actual type (str, float, etc.) is preserved — when a value IS written, it is validated against the declared type.
  • trails doctor reports fill rates:
    Article:
      sentiment:      23% filled (115 / 500)
      quality_score:   0% filled (0 / 500)
      summary:        89% filled (445 / 500)
    

Placeholder fields are typically created by trails onto transform --enrich (see transformation guide), but you can also declare them manually in any @node_type.

The @enrichment decorator

Register a function that fills a specific placeholder field on a specific node type:

from trails.enrichment import enrichment

The decorator takes two required arguments:

Argument Description
target_type The @node_type name (e.g., "Article")
field The placeholder field name (e.g., "sentiment")

And one optional argument:

Argument Description
depends_on List of field names that must be filled first

The decorated function receives two arguments:

Argument Description
node The ORM instance (e.g., an Article with node.title, node.body, etc.)
ctx The Trails context with ctx.llm, ctx.kg, ctx.cost

The function returns the value to write into the placeholder field.

Non-LLM enrichment examples

Most enrichment does not need an LLM. Computation, regex, lookups, and derivation from existing fields are deterministic, free, and fast.

Computed from existing fields

@enrichment(target_type="Article", field="word_count")
def count_words(node, ctx):
    """Count words in the article body. Pure computation, zero cost."""
    if not node.body:
        return 0
    return len(node.body.split())

Regex extraction

import re

@enrichment(target_type="Article", field="first_url")
def extract_first_url(node, ctx):
    """Extract the first URL from the article body using regex."""
    match = re.search(r'https?://\S+', node.body or "")
    return match.group(0) if match else None

API lookup (no LLM)

import httpx

@enrichment(target_type="Article", field="language")
def detect_language(node, ctx):
    """Detect article language using a free API."""
    if not node.body:
        return None
    response = httpx.post(
        "https://libretranslate.de/detect",
        data={"q": node.body[:500]},
    )
    results = response.json()
    return results[0]["language"] if results else None

Derivation from other node types

Enrichment functions can read from any type via ctx.kg:

@enrichment(target_type="Article", field="author_bio")
def lookup_author_bio(node, ctx):
    """Fill author_bio by looking up the Author node."""
    authors = ctx.kg.match(type="Author", name=node.author_name)
    if authors:
        return authors[0].biography
    return None

Multi-field derivation with dependencies

Use depends_on to ensure fields are filled in the right order:

@enrichment(target_type="Article", field="quality_score",
            depends_on=["word_count", "sentiment"])
def compute_quality(node, ctx):
    """Derive quality score from other enriched fields."""
    score = 0.0
    if node.word_count and node.word_count > 200:
        score += 0.5
    if node.sentiment and node.sentiment != "negative":
        score += 0.3
    if node.title and len(node.title) > 10:
        score += 0.2
    return round(score, 2)

The scheduler resolves the dependency graph and executes enrichment functions in topological order.

LLM enrichment (optional)

For fields that require natural-language understanding, enrichment functions can use ctx.llm. This is optional — it requires a configured LLM provider and costs tokens.

@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
    """Classify sentiment using LLM. Optional — requires LLM provider."""
    response = ctx.llm.complete(
        f"Classify the sentiment of this text as positive, negative, "
        f"or neutral. Reply with one word only.\n\n{node.body[:500]}",
        max_tokens=10,
    )
    return response.text.strip().lower()

@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
    """Generate a one-sentence summary using LLM."""
    response = ctx.llm.complete(
        f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
        max_tokens=100,
    )
    return response.text.strip()

LLM enrichment functions behave identically to non-LLM ones. The @enrichment decorator does not care what the function does internally. The distinction matters only for cost and determinism.

run_enrichments() Python API

from trails.enrichment import run_enrichments

# Run all registered enrichments on all unfilled placeholders
result = run_enrichments(ctx)

print(f"Enriched {result.nodes_processed} nodes")
print(f"Fields filled: {result.fields_filled}")
print(f"Fields skipped: {result.fields_skipped}")
print(f"Total cost: ${result.total_cost_usd:.4f}")

# Run enrichments for a specific type
result = run_enrichments(ctx, target_type="Article")

# Run a specific field only
result = run_enrichments(ctx, target_type="Article", field="sentiment")

# Dry-run: see what would be enriched without executing
result = run_enrichments(ctx, dry_run=True)
for item in result.plan:
    print(f"  Would enrich {item.node_type}.{item.field} "
          f"on {item.unfilled_count} nodes")

# Force re-enrichment of already-filled fields
result = run_enrichments(ctx, force=True)

CLI usage

# Run all registered enrichments on unfilled placeholders
trails enrich run

# Run enrichments for a specific type
trails enrich run --type Article

# Run a specific field enrichment
trails enrich run --type Article --field sentiment

# Dry-run: show what would be enriched without executing
trails enrich run --dry-run

# Force re-enrichment of already-filled fields
trails enrich run --force

# Use Batches API for LLM enrichments (50% cost reduction)
trails enrich run --batch

# Show fill rates per placeholder field
trails enrich status

# List all registered enrichment functions
trails enrich list

trails enrich status

Shows the fill rate for every placeholder field across all node types:

Article:
  sentiment:      23% filled (115 / 500)
  quality_score:   0% filled (0 / 500)
  summary:        89% filled (445 / 500)
  word_count:    100% filled (500 / 500)

Author:
  biography:      45% filled (90 / 200)

trails enrich list

Shows all registered enrichment functions:

Article.sentiment     → analyse_sentiment    (LLM)
Article.quality_score → compute_quality      (computed, depends_on: [word_count, sentiment])
Article.summary       → summarise            (LLM)
Article.word_count    → count_words          (computed)
Author.biography      → lookup_author_bio    (lookup)

Cost tracking

Every trails enrich run invocation opens a CostTracker envelope tagged enrich:<type>:<field> per ADR-0012. The CLI reports cost at completion:

Enrichment complete.
  Nodes processed: 500
  Fields filled:   1385
  Cost breakdown:
    Article.word_count:    $0.0000 (computed)
    Article.sentiment:     $0.0850 (LLM: haiku, 500 calls)
    Article.summary:       $0.4200 (LLM: haiku, 500 calls)
    Article.quality_score: $0.0000 (computed)
  Total cost: $0.5050

Non-LLM enrichment functions report $0.0000 — they are free.

When --batch is passed, LLM enrichment functions are grouped and sent via the Batches API for a 50% discount.

PROV-O provenance

Each enriched value emits a prov:Activity linking:

  • The enrichment function (as prov:Agent)
  • The filled value (as prov:Entity)
  • The source node (as prov:Entity used by the activity)

The activity records the model used (if any), cost, and timestamp. trails trace can answer "who filled this field and when?"

# Find provenance for a specific enriched value
SELECT ?activity ?agent ?timestamp ?cost
WHERE {
    ?activity prov:generated <article-123#sentiment> ;
              prov:wasAssociatedWith ?agent ;
              prov:startedAtTime ?timestamp .
    OPTIONAL { ?activity trails:costUSD ?cost }
}

Execution semantics

  • Idempotent. Running trails enrich run twice skips already-filled fields. Re-enrichment requires --force.
  • Order-aware. The scheduler resolves depends_on declarations and executes in topological order. If quality_score depends on sentiment, sentiment runs first.
  • Batch-aware. --batch groups LLM calls for the Batches API (50% discount). Functions are grouped by field.
  • Cost-tracked. Every invocation opens a cost envelope. The budget is enforced if max_cost_per_run_usd is set in the active baseline.
  • Non-LLM first. The scheduler runs deterministic enrichments (computed, regex, lookup) before LLM enrichments, maximizing fill rate per dollar.

Reference

Symbol Description
@enrichment(target_type, field, *, depends_on) Register a function as an enrichment for a placeholder field
run_enrichments(ctx, *, target_type, field, dry_run, force) Execute registered enrichments; returns EnrichmentResult
EnrichmentResult .nodes_processed, .fields_filled, .fields_skipped, .total_cost_usd, .plan
placeholder Marker for @node_type field definitions: (str, placeholder)

See also