Skip to content

Chapter 11 — Schema Transformation, Enrichment, and Baselines

Your knowledge graph is never static. Legacy systems retire, standards evolve, new fields appear, regulations tighten. Trails treats these changes as first-class operations: transform schemas declaratively, enrich data in batches, and validate everything against portable baselines.

This chapter walks through the full lifecycle: migrating data between schemas, filling gaps with enrichment functions, and locking down project configurations with baselines.


Learning objectives

After reading this chapter, you will be able to:

  • Use trails onto transform to plan and execute schema migrations
  • Declare placeholder fields for data that will be enriched later
  • Register enrichment functions with the @enrichment decorator
  • Run batch enrichment with trails enrich run and monitor fill rates
  • Define baselines that constrain project configuration
  • Scaffold projects from baselines and validate compliance
  • Understand how federated instances negotiate compatible baselines

Why schema transformation matters

Three scenarios make schema transformation a daily concern:

  1. Legacy migration. You inherit a database with field names like emp_nm, dept_cd, and sal_amt. The new schema uses name, department, and salary. Renaming a few fields is manageable; renaming hundreds across structural changes (flat records to nested nodes, coded values to enumerations) is not.

  2. Cross-domain alignment. A research team stores clinical data in a custom schema. A collaborator uses FHIR. Neither team will rewrite their application, but they need interoperable data. Schema transformation maps one representation to the other without changing either team's code.

  3. Version upgrades. Your own schema evolves. v1 stored author names as strings; v2 models authors as separate nodes with biography and affiliation. The existing data must migrate from v1 to v2 without downtime or data loss.

In all three cases, the pattern is the same: source schema in, target schema in, transformation plan out, reviewed execution.


trails onto transform — planning and executing transformations

The trails onto transform command compares two schemas and generates a transformation plan. It works with Python modules (@node_type declarations) or Turtle files (SHACL shapes) as inputs.

Generating a plan (dry-run)

By default, the command produces a plan without modifying data:

trails onto transform --source models/v1.py --target models/v2.py

Output:

Transformation plan: v1 → v2
  [auto]   Article.title       → Article.title        (same name, same type)
  [auto]   Article.body        → Article.body          (same name, same type)
  [rename] Article.author_name → Article.creator_name  (fuzzy match, 0.91)
  [coerce] Article.pub_date    → Article.published_at  (str → datetime)
  [split]  Article.full_name   → Author.first + Author.last (structural)
  [placeholder] Article.sentiment   — no source equivalent
  [placeholder] Article.quality_score — no source equivalent

5 auto-mappable, 1 rename, 1 coercion, 1 structural, 2 placeholders
Run with --execute to apply. Run with --show-code to see generated queries.

The engine uses three strategies in order:

  1. Exact match. Fields with identical names and compatible types are mapped automatically. No configuration, no cost.

  2. Fuzzy match. Fields with similar names (author_name and creator_name, pub_date and published_at) are matched using string similarity. A confidence score is reported. This is deterministic and free — no LLM needed.

  3. LLM-assisted match. When --llm-assist is passed, ambiguous mappings are proposed by a cheap LLM call. The user confirms or overrides. Cost-tracked through the standard cost envelope system.

The important principle: core transformation works without any LLM or API. Exact and fuzzy matching handle the majority of cases. LLM assistance is an optional accelerator for edge cases, never a requirement.

Inspecting generated code

trails onto transform --source models/v1.py --target models/v2.py --show-code

The engine generates SPARQL CONSTRUCT queries for simple operations (renames, type coercion, merges) and Python functions for complex ones (splits, conditional logic):

# Rename: author_name → creator_name
CONSTRUCT {
  ?s <https://app.example/creator_name> ?val .
}
WHERE {
  ?s a <https://app.example/Article> ;
     <https://app.example/author_name> ?val .
}
# Split: full_name → first_name + last_name
def split_full_name(node, ctx):
    parts = node.full_name.rsplit(" ", 1)
    return {
        "first_name": parts[0],
        "last_name": parts[1] if len(parts) > 1 else "",
    }

The generated code is visible, editable, and committable. No black box.

Executing the transformation

trails onto transform --source models/v1.py --target models/v2.py --execute

This runs the generated queries and functions against the store. The command reports:

Transformed 1,247 nodes, 8,729 triples
  renamed:  1,247 author_name → creator_name
  coerced:  1,247 pub_date (str → datetime)
  split:    1,247 full_name → first_name + last_name
  placeholders: 2 fields marked on 1,247 nodes
Duration: 2.3s

Every transformation is recorded as a prov:Activitytrails trace can answer "when was this field renamed and by what operation?"

The Python API

from trails.onto_transform import generate_plan, execute_plan

plan = generate_plan(
    source_schema="models/v1.py",
    target_schema="models/v2.py",
    ctx=ctx,
)

# Inspect the plan
for mapping in plan.field_mappings:
    print(f"{mapping.source}{mapping.target} ({mapping.strategy})")

for ph in plan.placeholders:
    print(f"  [placeholder] {ph.field} — no source equivalent")

# Execute when satisfied
result = execute_plan(plan, ctx)
print(f"Transformed {result.node_count} nodes, {result.triple_count} triples")

Supported transformations

Category Example Strategy
Field rename author_namecreator_name SPARQL CONSTRUCT with predicate substitution
Type coercion "42" (string) → 42 (integer) SPARQL BIND with cast
Field split full_namefirst_name + last_name Python function
Field merge first_name + last_namefull_name SPARQL CONCAT
Structural change flat author string → nested Author node SPARQL CONSTRUCT + INSERT
Enumeration mapping "M"/"F""male"/"female" SPARQL VALUES lookup table
Unmappable fields target field has no source equivalent Marked as placeholder

Placeholder fields — introducing fields for later enrichment

When a target schema has fields that don't exist in the source, those fields are unmappable. Rather than rejecting them or silently dropping them, Trails marks them as placeholders — typed fields that explicitly declare "this value will be filled later."

Declaring placeholders

from trails.orm import node_type, placeholder

@node_type("Article", fields={
    "title": str,
    "body": str,
    "sentiment": (str, placeholder),        # not yet filled
    "quality_score": (float, placeholder),   # not yet filled
    "summary": (str, placeholder),           # not yet filled
})
class Article:
    pass

The placeholder marker is metadata, not a type modifier. The field's actual type (str, float) is preserved and enforced when a value IS written. But writes that omit a placeholder field succeed without validation errors — placeholders are nullable by default.

How placeholders behave

  • SHACL export emits sh:minCount 0 for placeholder fields regardless of the type's normal cardinality rules. A str placeholder does not generate sh:minCount 1.
  • ORM queries return None for unfilled placeholders.
  • trails doctor reports placeholder fill rates:
Article:
  sentiment:      23% filled (115 / 500)
  quality_score:   0% filled (0 / 500)
  summary:        89% filled (445 / 500)

Integration with trails onto transform

When you pass --enrich to the transform command, target fields that have no source equivalent are automatically marked as placeholders in the generated code:

trails onto transform --source models/v1.py --target models/v2.py --enrich

This bridges transformation and enrichment: transform what you can, mark what you can't, enrich later.


The @enrichment decorator — registering enrichment functions

Enrichment functions fill placeholder fields. They are registered with the @enrichment decorator and executed in batch by trails enrich run.

Basic enrichment functions

from trails.enrichment import enrichment

@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
    """Classify sentiment from the article body."""
    if not node.body:
        return None
    # Simple heuristic — no LLM needed
    positive_words = {"good", "great", "excellent", "innovative"}
    negative_words = {"bad", "poor", "terrible", "flawed"}
    words = set(node.body.lower().split())
    pos = len(words & positive_words)
    neg = len(words & negative_words)
    if pos > neg:
        return "positive"
    elif neg > pos:
        return "negative"
    return "neutral"

Enrichment functions receive the node (an ORM instance) and the Trails context. They return the value to write, or None to skip.

The key design principle: enrichment functions can use any logic. String manipulation, regex, math, external API lookups, database queries, or LLM calls. The framework does not prescribe the method. LLM is one option among many, not the default.

LLM-powered enrichment (optional)

When heuristics are not sufficient, enrichment functions can use ctx.llm:

@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
    """Generate a one-sentence summary using LLM."""
    if not node.body:
        return None
    response = ctx.llm.complete(
        f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
        max_tokens=100,
    )
    return response.text.strip()

LLM calls are cost-tracked through the standard cost envelope system. Each trails enrich run invocation opens a cost envelope tagged enrich:<type>:<field>, and the CLI reports total cost at completion.

Derived enrichment (from existing data)

Enrichment functions can compute values from other fields on the same node or from other nodes entirely:

@enrichment(target_type="Article", field="quality_score")
def compute_quality(node, ctx):
    """Derive quality score from existing fields — zero cost."""
    score = 0.0
    if node.title and len(node.title) > 10:
        score += 0.3
    if node.body and len(node.body) > 200:
        score += 0.5
    if node.sentiment:  # filled by a prior enrichment
        score += 0.2
    return round(score, 2)

Enrichment dependencies

When one enrichment depends on another (e.g., quality_score needs sentiment to be filled first), declare the dependency:

@enrichment(target_type="Article", field="quality_score",
            depends_on=["sentiment"])
def compute_quality(node, ctx):
    ...

The scheduler resolves the dependency graph and executes enrichment functions in topological order. Fields without dependencies run first; dependent fields run after their prerequisites are filled.


trails enrich run and trails enrich status — batch enrichment

Running enrichments

# Run all registered enrichments on unfilled placeholders
trails enrich run

# Target a specific type
trails enrich run --type Article

# Target a specific field
trails enrich run --type Article --field sentiment

# Dry-run: show what would be enriched without executing
trails enrich run --dry-run

# Use the Batches API for LLM-powered enrichments (50% discount)
trails enrich run --batch

trails enrich run iterates over all instances of the target type, checks which placeholder fields are unfilled (None or absent), and calls the registered enrichment function for each.

Enrichment is idempotent — running it twice skips already-filled fields. Use --force to re-enrich previously filled values.

Monitoring fill rates

trails enrich status

Output:

Article (500 instances):
  title:          100% filled (500 / 500) — not a placeholder
  body:           100% filled (500 / 500) — not a placeholder
  sentiment:       23% filled (115 / 500) — enrichment registered
  quality_score:    0% filled (0 / 500)   — enrichment registered
  summary:         89% filled (445 / 500) — enrichment registered

Registered enrichments:
  Article.sentiment      → analyse_sentiment (heuristic)
  Article.quality_score  → compute_quality (derived, depends_on: [sentiment])
  Article.summary        → summarise (llm)

Provenance

Every enriched value emits a prov:Activity linking the enrichment function (as prov:Agent) to the filled value (as prov:Entity). The activity records the function name, model used (if any), cost, and timestamp. trails trace can answer "who filled this field and when?"


Agent-driven enrichment

For fully autonomous enrichment without pre-registered functions, trails enrich agent spawns an agent that discovers unfilled placeholders and attempts to fill them.

How the agent works

The agent follows the ReAct (Reason + Act) loop:

  1. Observe. Query the store for all placeholder fields with fill rates below 100%. Rank fields by estimated difficulty: type coercion (easy) → derivation from existing data (medium) → LLM inference (hard).

  2. Plan. For each unfilled field, decide a strategy:

  3. Derive — compute from other fields on the same node
  4. Lookup — query external APIs or reference data
  5. Infer — use LLM to generate a plausible value from context
  6. Skip — mark as "cannot fill" with a reason

  7. Act. Execute the strategy, fill the value, record provenance.

  8. Verify. Check the filled value against the field's type constraint. Invalid values are rolled back with a warning.

CLI usage

# Spawn enrichment agent
trails enrich agent

# With budget constraint (stops when budget is exhausted)
trails enrich agent --budget 5.00

# Target specific types
trails enrich agent --type Article --type Author

# Progressive mode: cheap operations first, then expensive ones
trails enrich agent --progressive

The --progressive flag maximises fill rate per dollar: string manipulation and type coercion execute before LLM calls. The --budget flag sets a hard cost ceiling — when the envelope is exhausted, the agent stops, reports what was filled, and lists what remains.

When to use agent enrichment vs. registered functions

Scenario Use
Known logic, repeated runs @enrichment + trails enrich run
Exploratory, one-off, unknown logic trails enrich agent
Cost-sensitive production @enrichment (predictable cost)
Rapid prototyping trails enrich agent --budget 2.00

Agent-driven enrichment is powerful but non-deterministic. Two runs may produce different values. For production pipelines, registered enrichment functions give you repeatability and testability. The agent is best for exploration, bootstrapping, and filling gaps that don't justify a dedicated function.


Baselines — portable configuration contracts

A baseline declares what a Trails project MUST look like to be compliant with a named profile. It is not the configuration itself — it is constraints ON the configuration.

Think of baselines as .eslintrc for your knowledge-graph app: trails.toml says what the app IS; the baseline says what it SHOULD BE.

The baseline format

Baselines are TOML files with a [baseline] header and typed sections:

[baseline]
name = "healthcare-fhir"
version = "1.0"
extends = "trails:default"

[baseline.store]
backend = "oxigraph"
reasoning = "rdfs"
max_query_time_ms = 30000

[baseline.schema]
upper_ontology = "schema.org"
core_types = ["Patient", "Encounter", "Observation"]
alignment = "fhir-r4"
shacl_strictness = "closed"

[baseline.pipeline]
default_extractors = ["pdf", "html", "docx"]
enrichment_steps = ["ner", "linking", "dedup"]
validation = "strict"

[baseline.policy]
template = "hipaa"
audit_level = "full"

[baseline.agents]
default_model = "haiku"
max_cost_per_run_usd = 1.0
planner = "react"
budget_enforcement = true

[baseline.federation]
require_policy_alignment = true
minimum_peer_trust_level = "verified"

Sections are optional. A baseline that only declares [baseline.policy] constrains only the policy dimension — everything else is unconstrained.

Built-in baselines

Trails ships four built-in presets:

Baseline Description
trails:default Minimal, no constraints, open-world. SHACL open, no policy required, no reasoning. The zero-overhead starting point.
trails:strict SHACL closed-world validation, full PROV-O provenance, Cedar policy required on all capabilities, cost envelopes enforced.
trails:research RDF-star enabled, reasoning active (RDFS), large query timeouts (120s), open SHACL, relaxed cost limits.
trails:compliance Full audit trail, consent receipts required, PROV-O on every write, cost tracking mandatory, budget enforcement on.

Custom baselines

Create your own by writing a TOML file in baselines/ or ~/.trails/baselines/:

[baseline]
name = "my-team-standard"
version = "1.0"
extends = "trails:strict"

[baseline.schema]
core_types = ["Project", "Task", "Person"]
shacl_strictness = "closed"

[baseline.agents]
default_model = "haiku"
max_cost_per_run_usd = 0.50

Baseline inheritance

The extends field references a parent baseline:

  • extends = "trails:default" — built-in preset
  • extends = "trails:strict" — built-in strict preset
  • extends = "file:../base.toml" — relative file path
  • extends = ["trails:compliance", "file:../domain.toml"] — multiple parents (last wins on conflict, with a warning)

Every baseline without an explicit extends inherits from trails:default. Child values override parent values at the leaf level.


trails new --baseline and baseline CLI commands

Scaffolding a project from a baseline

trails new myapp --baseline healthcare-fhir

This scaffolds a new project where trails.toml is generated to satisfy the baseline constraints, and baselines/active.toml records the active baseline.

Validating against a baseline

trails baseline validate

Checks the current trails.toml against the active baseline. Reports violations as structured diagnostics (same format as trails doctor):

baseline: schema — FAIL
  core_types: missing "Observation" (declared in baseline, not registered)
baseline: policy — PASS
  template: "hipaa" loaded
baseline: agents — WARN
  max_cost_per_run_usd: configured as 2.0, baseline requires <= 1.0
baseline: store — PASS
  backend: oxigraph, reasoning: rdfs

2 passed, 1 warning, 1 failure

trails doctor also runs baseline checks automatically when a baseline is active.

Exporting a baseline from an existing project

trails baseline export -o baselines/extracted.toml

Extracts the current trails.toml configuration as a baseline file. Useful for capturing a working project's configuration as a reusable template.

Comparing baselines

trails baseline diff base1.toml base2.toml

Shows differences section by section:

[baseline.schema]
  shacl_strictness: "open" (base1) → "closed" (base2)
  core_types: base1 has ["Article"], base2 has ["Article", "Author"]

[baseline.agents]
  max_cost_per_run_usd: 5.0 (base1) → 1.0 (base2)

[baseline.policy]
  — only in base2: template = "hipaa"

Federation baseline negotiation (conceptual)

When two Trails instances federate, baselines enable automatic compatibility checks. This is a conceptual overview — federation negotiation is designed for post-M12 implementation.

The negotiation flow

  1. Exchange. On federation handshake, each instance sends its active baseline (minus sensitive fields like API keys) to the peer.

  2. Compatibility check. The framework computes whether the two baselines are compatible: same or compatible SHACL strictness, policy templates that satisfy each other's requirements, compatible reasoning modes (RDFS subsumes no-reasoning; OWL-RL subsumes RDFS).

  3. Negotiation. If baselines differ but overlap, the framework computes a "negotiated baseline" — the intersection of compatible constraints. The stricter option wins for safety: if one peer requires shacl_strictness = "closed" and the other is "open", the negotiated baseline uses "closed".

  4. Recording. The negotiation result is recorded as a prov:Activity in both instances' provenance graphs.

  5. Rejection. If baselines are incompatible (e.g., one requires HIPAA policy and the other has no policy engine), federation is refused with a diagnostic explaining the incompatibility.

CLI preview

trails baseline negotiate peer_url

Exchanges baselines with a peer and outputs the negotiated baseline. This command will be implemented when federation reaches Level 3 (ADR-0023).


Full walkthrough: legacy data → new schema → enrich → validate

Here is the complete journey, end to end. We will migrate a legacy employee database to a modern schema, enrich it with derived and computed fields, and validate the result against a baseline.

Step 1: Start with legacy data

Assume a graph populated from a legacy system with flat, abbreviated field names:

from trails import capability

@capability
def import_legacy(ctx) -> dict:
    """Load legacy employee data."""
    records = [
        {"emp_nm": "Alice Schmidt", "dept_cd": "ENG", "sal_amt": 95000},
        {"emp_nm": "Bob Müller", "dept_cd": "MKT", "sal_amt": 82000},
        {"emp_nm": "Carol Weber", "dept_cd": "ENG", "sal_amt": 91000},
    ]
    for rec in records:
        ctx.kg.node(labels=["Employee"], properties=rec)
    return {"imported": len(records)}

Step 2: Define the target schema

The modern schema uses readable field names, splits the full name, and adds fields for enrichment:

from trails.orm import node_type, placeholder

@node_type("Employee", fields={
    "first_name": str,
    "last_name": str,
    "department": str,
    "salary": int,
    "seniority_level": (str, placeholder),   # to be enriched
    "department_full": (str, placeholder),    # to be enriched
})
class Employee: ...

Step 3: Plan the transformation

trails onto transform \
    --source models/legacy.py \
    --target models/v2.py \
    --enrich

Output:

Transformation plan: legacy → v2
  [split]       emp_nm   → first_name + last_name (structural)
  [rename]      dept_cd  → department (fuzzy match, 0.87)
  [rename]      sal_amt  → salary (fuzzy match, 0.83)
  [placeholder] seniority_level — no source equivalent
  [placeholder] department_full — no source equivalent

3 mappable, 2 placeholders

Review the plan. Adjust if needed. Then execute:

trails onto transform \
    --source models/legacy.py \
    --target models/v2.py \
    --enrich \
    --execute

Step 4: Register enrichment functions

from trails.enrichment import enrichment

@enrichment(target_type="Employee", field="department_full")
def expand_department(node, ctx):
    """Expand department codes to full names — pure lookup, zero cost."""
    codes = {"ENG": "Engineering", "MKT": "Marketing", "FIN": "Finance"}
    return codes.get(node.department, node.department)

@enrichment(target_type="Employee", field="seniority_level",
            depends_on=["department_full"])
def estimate_seniority(node, ctx):
    """Derive seniority from salary bands — no LLM needed."""
    if node.salary >= 100000:
        return "senior"
    elif node.salary >= 85000:
        return "mid"
    return "junior"

Both functions are deterministic and free. No LLM calls, no API costs.

Step 5: Run enrichment

trails enrich run --type Employee

Output:

Enriching Employee (3 instances):
  department_full:  3 / 3 filled (expand_department)
  seniority_level:  3 / 3 filled (estimate_seniority)
Total cost: $0.00
Duration: 0.1s

Check the result:

trails enrich status --type Employee
Employee (3 instances):
  first_name:       100% filled (3 / 3)
  last_name:        100% filled (3 / 3)
  department:       100% filled (3 / 3)
  salary:           100% filled (3 / 3)
  seniority_level:  100% filled (3 / 3) — enriched
  department_full:  100% filled (3 / 3) — enriched

Step 6: Define and validate a baseline

Create baselines/active.toml:

[baseline]
name = "employee-app"
version = "1.0"
extends = "trails:strict"

[baseline.schema]
core_types = ["Employee"]
shacl_strictness = "closed"

[baseline.agents]
default_model = "haiku"
max_cost_per_run_usd = 1.0
budget_enforcement = true

Validate:

trails baseline validate
baseline: schema — PASS
  core_types: "Employee" registered
  shacl_strictness: closed
baseline: agents — PASS
  budget_enforcement: on
  max_cost_per_run_usd: 1.0

2 passed, 0 warnings, 0 failures

Step 7: Verify provenance

Every step left a trail. Query it:

@capability
def transformation_history(ctx) -> list:
    """Show what happened to our data."""
    results = ctx.kg.sparql("""
        SELECT ?entity ?activity ?kind ?timestamp
        WHERE {
            ?entity prov:wasGeneratedBy ?activity .
            ?activity trails:activityKind ?kind ;
                      prov:endedAtTime ?timestamp .
        }
        ORDER BY ?timestamp
        LIMIT 20
    """)
    return [dict(r) for r in results]

The provenance graph records the full chain: ingestion → transformation → enrichment, with timestamps, function names, and costs at every step.


The progressive story

Schema transformation, enrichment, and baselines follow the same progressive-enhancement principle as the rest of Trails:

Level What you use What you get
0 Manual SPARQL CONSTRUCT Hand-written transformation queries
1 trails onto transform Auto-generated field mappings, dry-run
2 Placeholder fields + @enrichment Typed gaps + registered fillers
3 trails enrich run Batch enrichment with provenance
4 trails enrich agent Autonomous gap-filling with budget control
trails baseline validate Configuration compliance checking

Each level is additive. A project at Level 1 never sees enrichment. A project at Level 3 may never need Level 4. Manual SPARQL CONSTRUCT continues to work at every level.

Baselines are orthogonal — they validate configuration at any level, from a bare trails.toml to a fully enriched, federated deployment.


See also

  • Auto-Ontology — schema inference and generation, the precursor to transformation
  • Data Integration — RML mappings for ingesting external data before transformation
  • Trust and Policy — Cedar policies and provenance, which baselines can constrain
  • Federation and Scale — multi-instance federation, where baseline negotiation applies
  • ADR-0026 — the transformation and enrichment architecture decision
  • ADR-0027 — the baselines architecture decision