Chapter 11 — Schema Transformation, Enrichment, and Baselines¶
Your knowledge graph is never static. Legacy systems retire, standards evolve, new fields appear, regulations tighten. Trails treats these changes as first-class operations: transform schemas declaratively, enrich data in batches, and validate everything against portable baselines.
This chapter walks through the full lifecycle: migrating data between schemas, filling gaps with enrichment functions, and locking down project configurations with baselines.
Learning objectives¶
After reading this chapter, you will be able to:
- Use
trails onto transformto plan and execute schema migrations - Declare placeholder fields for data that will be enriched later
- Register enrichment functions with the
@enrichmentdecorator - Run batch enrichment with
trails enrich runand monitor fill rates - Define baselines that constrain project configuration
- Scaffold projects from baselines and validate compliance
- Understand how federated instances negotiate compatible baselines
Why schema transformation matters¶
Three scenarios make schema transformation a daily concern:
-
Legacy migration. You inherit a database with field names like
emp_nm,dept_cd, andsal_amt. The new schema usesname,department, andsalary. Renaming a few fields is manageable; renaming hundreds across structural changes (flat records to nested nodes, coded values to enumerations) is not. -
Cross-domain alignment. A research team stores clinical data in a custom schema. A collaborator uses FHIR. Neither team will rewrite their application, but they need interoperable data. Schema transformation maps one representation to the other without changing either team's code.
-
Version upgrades. Your own schema evolves. v1 stored author names as strings; v2 models authors as separate nodes with biography and affiliation. The existing data must migrate from v1 to v2 without downtime or data loss.
In all three cases, the pattern is the same: source schema in, target schema in, transformation plan out, reviewed execution.
trails onto transform — planning and executing transformations¶
The trails onto transform command compares two schemas and generates a
transformation plan. It works with Python modules (@node_type
declarations) or Turtle files (SHACL shapes) as inputs.
Generating a plan (dry-run)¶
By default, the command produces a plan without modifying data:
Output:
Transformation plan: v1 → v2
[auto] Article.title → Article.title (same name, same type)
[auto] Article.body → Article.body (same name, same type)
[rename] Article.author_name → Article.creator_name (fuzzy match, 0.91)
[coerce] Article.pub_date → Article.published_at (str → datetime)
[split] Article.full_name → Author.first + Author.last (structural)
[placeholder] Article.sentiment — no source equivalent
[placeholder] Article.quality_score — no source equivalent
5 auto-mappable, 1 rename, 1 coercion, 1 structural, 2 placeholders
Run with --execute to apply. Run with --show-code to see generated queries.
The engine uses three strategies in order:
-
Exact match. Fields with identical names and compatible types are mapped automatically. No configuration, no cost.
-
Fuzzy match. Fields with similar names (
author_nameandcreator_name,pub_dateandpublished_at) are matched using string similarity. A confidence score is reported. This is deterministic and free — no LLM needed. -
LLM-assisted match. When
--llm-assistis passed, ambiguous mappings are proposed by a cheap LLM call. The user confirms or overrides. Cost-tracked through the standard cost envelope system.
The important principle: core transformation works without any LLM or API. Exact and fuzzy matching handle the majority of cases. LLM assistance is an optional accelerator for edge cases, never a requirement.
Inspecting generated code¶
The engine generates SPARQL CONSTRUCT queries for simple operations (renames, type coercion, merges) and Python functions for complex ones (splits, conditional logic):
# Rename: author_name → creator_name
CONSTRUCT {
?s <https://app.example/creator_name> ?val .
}
WHERE {
?s a <https://app.example/Article> ;
<https://app.example/author_name> ?val .
}
# Split: full_name → first_name + last_name
def split_full_name(node, ctx):
parts = node.full_name.rsplit(" ", 1)
return {
"first_name": parts[0],
"last_name": parts[1] if len(parts) > 1 else "",
}
The generated code is visible, editable, and committable. No black box.
Executing the transformation¶
This runs the generated queries and functions against the store. The command reports:
Transformed 1,247 nodes, 8,729 triples
renamed: 1,247 author_name → creator_name
coerced: 1,247 pub_date (str → datetime)
split: 1,247 full_name → first_name + last_name
placeholders: 2 fields marked on 1,247 nodes
Duration: 2.3s
Every transformation is recorded as a prov:Activity — trails trace
can answer "when was this field renamed and by what operation?"
The Python API¶
from trails.onto_transform import generate_plan, execute_plan
plan = generate_plan(
source_schema="models/v1.py",
target_schema="models/v2.py",
ctx=ctx,
)
# Inspect the plan
for mapping in plan.field_mappings:
print(f"{mapping.source} → {mapping.target} ({mapping.strategy})")
for ph in plan.placeholders:
print(f" [placeholder] {ph.field} — no source equivalent")
# Execute when satisfied
result = execute_plan(plan, ctx)
print(f"Transformed {result.node_count} nodes, {result.triple_count} triples")
Supported transformations¶
| Category | Example | Strategy |
|---|---|---|
| Field rename | author_name → creator_name |
SPARQL CONSTRUCT with predicate substitution |
| Type coercion | "42" (string) → 42 (integer) |
SPARQL BIND with cast |
| Field split | full_name → first_name + last_name |
Python function |
| Field merge | first_name + last_name → full_name |
SPARQL CONCAT |
| Structural change | flat author string → nested Author node |
SPARQL CONSTRUCT + INSERT |
| Enumeration mapping | "M"/"F" → "male"/"female" |
SPARQL VALUES lookup table |
| Unmappable fields | target field has no source equivalent | Marked as placeholder |
Placeholder fields — introducing fields for later enrichment¶
When a target schema has fields that don't exist in the source, those fields are unmappable. Rather than rejecting them or silently dropping them, Trails marks them as placeholders — typed fields that explicitly declare "this value will be filled later."
Declaring placeholders¶
from trails.orm import node_type, placeholder
@node_type("Article", fields={
"title": str,
"body": str,
"sentiment": (str, placeholder), # not yet filled
"quality_score": (float, placeholder), # not yet filled
"summary": (str, placeholder), # not yet filled
})
class Article:
pass
The placeholder marker is metadata, not a type modifier. The field's
actual type (str, float) is preserved and enforced when a value IS
written. But writes that omit a placeholder field succeed without
validation errors — placeholders are nullable by default.
How placeholders behave¶
- SHACL export emits
sh:minCount 0for placeholder fields regardless of the type's normal cardinality rules. Astrplaceholder does not generatesh:minCount 1. - ORM queries return
Nonefor unfilled placeholders. trails doctorreports placeholder fill rates:
Article:
sentiment: 23% filled (115 / 500)
quality_score: 0% filled (0 / 500)
summary: 89% filled (445 / 500)
Integration with trails onto transform¶
When you pass --enrich to the transform command, target fields that
have no source equivalent are automatically marked as placeholders in
the generated code:
This bridges transformation and enrichment: transform what you can, mark what you can't, enrich later.
The @enrichment decorator — registering enrichment functions¶
Enrichment functions fill placeholder fields. They are registered with
the @enrichment decorator and executed in batch by
trails enrich run.
Basic enrichment functions¶
from trails.enrichment import enrichment
@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
"""Classify sentiment from the article body."""
if not node.body:
return None
# Simple heuristic — no LLM needed
positive_words = {"good", "great", "excellent", "innovative"}
negative_words = {"bad", "poor", "terrible", "flawed"}
words = set(node.body.lower().split())
pos = len(words & positive_words)
neg = len(words & negative_words)
if pos > neg:
return "positive"
elif neg > pos:
return "negative"
return "neutral"
Enrichment functions receive the node (an ORM instance) and the Trails
context. They return the value to write, or None to skip.
The key design principle: enrichment functions can use any logic. String manipulation, regex, math, external API lookups, database queries, or LLM calls. The framework does not prescribe the method. LLM is one option among many, not the default.
LLM-powered enrichment (optional)¶
When heuristics are not sufficient, enrichment functions can use
ctx.llm:
@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
"""Generate a one-sentence summary using LLM."""
if not node.body:
return None
response = ctx.llm.complete(
f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
max_tokens=100,
)
return response.text.strip()
LLM calls are cost-tracked through the standard cost envelope system.
Each trails enrich run invocation opens a cost envelope tagged
enrich:<type>:<field>, and the CLI reports total cost at completion.
Derived enrichment (from existing data)¶
Enrichment functions can compute values from other fields on the same node or from other nodes entirely:
@enrichment(target_type="Article", field="quality_score")
def compute_quality(node, ctx):
"""Derive quality score from existing fields — zero cost."""
score = 0.0
if node.title and len(node.title) > 10:
score += 0.3
if node.body and len(node.body) > 200:
score += 0.5
if node.sentiment: # filled by a prior enrichment
score += 0.2
return round(score, 2)
Enrichment dependencies¶
When one enrichment depends on another (e.g., quality_score needs
sentiment to be filled first), declare the dependency:
@enrichment(target_type="Article", field="quality_score",
depends_on=["sentiment"])
def compute_quality(node, ctx):
...
The scheduler resolves the dependency graph and executes enrichment functions in topological order. Fields without dependencies run first; dependent fields run after their prerequisites are filled.
trails enrich run and trails enrich status — batch enrichment¶
Running enrichments¶
# Run all registered enrichments on unfilled placeholders
trails enrich run
# Target a specific type
trails enrich run --type Article
# Target a specific field
trails enrich run --type Article --field sentiment
# Dry-run: show what would be enriched without executing
trails enrich run --dry-run
# Use the Batches API for LLM-powered enrichments (50% discount)
trails enrich run --batch
trails enrich run iterates over all instances of the target type,
checks which placeholder fields are unfilled (None or absent), and
calls the registered enrichment function for each.
Enrichment is idempotent — running it twice skips already-filled
fields. Use --force to re-enrich previously filled values.
Monitoring fill rates¶
Output:
Article (500 instances):
title: 100% filled (500 / 500) — not a placeholder
body: 100% filled (500 / 500) — not a placeholder
sentiment: 23% filled (115 / 500) — enrichment registered
quality_score: 0% filled (0 / 500) — enrichment registered
summary: 89% filled (445 / 500) — enrichment registered
Registered enrichments:
Article.sentiment → analyse_sentiment (heuristic)
Article.quality_score → compute_quality (derived, depends_on: [sentiment])
Article.summary → summarise (llm)
Provenance¶
Every enriched value emits a prov:Activity linking the enrichment
function (as prov:Agent) to the filled value (as prov:Entity). The
activity records the function name, model used (if any), cost, and
timestamp. trails trace can answer "who filled this field and when?"
Agent-driven enrichment¶
For fully autonomous enrichment without pre-registered functions,
trails enrich agent spawns an agent that discovers unfilled
placeholders and attempts to fill them.
How the agent works¶
The agent follows the ReAct (Reason + Act) loop:
-
Observe. Query the store for all placeholder fields with fill rates below 100%. Rank fields by estimated difficulty: type coercion (easy) → derivation from existing data (medium) → LLM inference (hard).
-
Plan. For each unfilled field, decide a strategy:
- Derive — compute from other fields on the same node
- Lookup — query external APIs or reference data
- Infer — use LLM to generate a plausible value from context
-
Skip — mark as "cannot fill" with a reason
-
Act. Execute the strategy, fill the value, record provenance.
-
Verify. Check the filled value against the field's type constraint. Invalid values are rolled back with a warning.
CLI usage¶
# Spawn enrichment agent
trails enrich agent
# With budget constraint (stops when budget is exhausted)
trails enrich agent --budget 5.00
# Target specific types
trails enrich agent --type Article --type Author
# Progressive mode: cheap operations first, then expensive ones
trails enrich agent --progressive
The --progressive flag maximises fill rate per dollar: string
manipulation and type coercion execute before LLM calls. The
--budget flag sets a hard cost ceiling — when the envelope is
exhausted, the agent stops, reports what was filled, and lists what
remains.
When to use agent enrichment vs. registered functions¶
| Scenario | Use |
|---|---|
| Known logic, repeated runs | @enrichment + trails enrich run |
| Exploratory, one-off, unknown logic | trails enrich agent |
| Cost-sensitive production | @enrichment (predictable cost) |
| Rapid prototyping | trails enrich agent --budget 2.00 |
Agent-driven enrichment is powerful but non-deterministic. Two runs may produce different values. For production pipelines, registered enrichment functions give you repeatability and testability. The agent is best for exploration, bootstrapping, and filling gaps that don't justify a dedicated function.
Baselines — portable configuration contracts¶
A baseline declares what a Trails project MUST look like to be compliant with a named profile. It is not the configuration itself — it is constraints ON the configuration.
Think of baselines as .eslintrc for your knowledge-graph app:
trails.toml says what the app IS; the baseline says what it SHOULD
BE.
The baseline format¶
Baselines are TOML files with a [baseline] header and typed sections:
[baseline]
name = "healthcare-fhir"
version = "1.0"
extends = "trails:default"
[baseline.store]
backend = "oxigraph"
reasoning = "rdfs"
max_query_time_ms = 30000
[baseline.schema]
upper_ontology = "schema.org"
core_types = ["Patient", "Encounter", "Observation"]
alignment = "fhir-r4"
shacl_strictness = "closed"
[baseline.pipeline]
default_extractors = ["pdf", "html", "docx"]
enrichment_steps = ["ner", "linking", "dedup"]
validation = "strict"
[baseline.policy]
template = "hipaa"
audit_level = "full"
[baseline.agents]
default_model = "haiku"
max_cost_per_run_usd = 1.0
planner = "react"
budget_enforcement = true
[baseline.federation]
require_policy_alignment = true
minimum_peer_trust_level = "verified"
Sections are optional. A baseline that only declares
[baseline.policy] constrains only the policy dimension — everything
else is unconstrained.
Built-in baselines¶
Trails ships four built-in presets:
| Baseline | Description |
|---|---|
trails:default |
Minimal, no constraints, open-world. SHACL open, no policy required, no reasoning. The zero-overhead starting point. |
trails:strict |
SHACL closed-world validation, full PROV-O provenance, Cedar policy required on all capabilities, cost envelopes enforced. |
trails:research |
RDF-star enabled, reasoning active (RDFS), large query timeouts (120s), open SHACL, relaxed cost limits. |
trails:compliance |
Full audit trail, consent receipts required, PROV-O on every write, cost tracking mandatory, budget enforcement on. |
Custom baselines¶
Create your own by writing a TOML file in baselines/ or
~/.trails/baselines/:
[baseline]
name = "my-team-standard"
version = "1.0"
extends = "trails:strict"
[baseline.schema]
core_types = ["Project", "Task", "Person"]
shacl_strictness = "closed"
[baseline.agents]
default_model = "haiku"
max_cost_per_run_usd = 0.50
Baseline inheritance¶
The extends field references a parent baseline:
extends = "trails:default"— built-in presetextends = "trails:strict"— built-in strict presetextends = "file:../base.toml"— relative file pathextends = ["trails:compliance", "file:../domain.toml"]— multiple parents (last wins on conflict, with a warning)
Every baseline without an explicit extends inherits from
trails:default. Child values override parent values at the leaf level.
trails new --baseline and baseline CLI commands¶
Scaffolding a project from a baseline¶
This scaffolds a new project where trails.toml is generated to satisfy
the baseline constraints, and baselines/active.toml records the active
baseline.
Validating against a baseline¶
Checks the current trails.toml against the active baseline. Reports
violations as structured diagnostics (same format as trails doctor):
baseline: schema — FAIL
core_types: missing "Observation" (declared in baseline, not registered)
baseline: policy — PASS
template: "hipaa" loaded
baseline: agents — WARN
max_cost_per_run_usd: configured as 2.0, baseline requires <= 1.0
baseline: store — PASS
backend: oxigraph, reasoning: rdfs
2 passed, 1 warning, 1 failure
trails doctor also runs baseline checks automatically when a baseline
is active.
Exporting a baseline from an existing project¶
Extracts the current trails.toml configuration as a baseline file.
Useful for capturing a working project's configuration as a reusable
template.
Comparing baselines¶
Shows differences section by section:
[baseline.schema]
shacl_strictness: "open" (base1) → "closed" (base2)
core_types: base1 has ["Article"], base2 has ["Article", "Author"]
[baseline.agents]
max_cost_per_run_usd: 5.0 (base1) → 1.0 (base2)
[baseline.policy]
— only in base2: template = "hipaa"
Federation baseline negotiation (conceptual)¶
When two Trails instances federate, baselines enable automatic compatibility checks. This is a conceptual overview — federation negotiation is designed for post-M12 implementation.
The negotiation flow¶
-
Exchange. On federation handshake, each instance sends its active baseline (minus sensitive fields like API keys) to the peer.
-
Compatibility check. The framework computes whether the two baselines are compatible: same or compatible SHACL strictness, policy templates that satisfy each other's requirements, compatible reasoning modes (RDFS subsumes no-reasoning; OWL-RL subsumes RDFS).
-
Negotiation. If baselines differ but overlap, the framework computes a "negotiated baseline" — the intersection of compatible constraints. The stricter option wins for safety: if one peer requires
shacl_strictness = "closed"and the other is"open", the negotiated baseline uses"closed". -
Recording. The negotiation result is recorded as a
prov:Activityin both instances' provenance graphs. -
Rejection. If baselines are incompatible (e.g., one requires HIPAA policy and the other has no policy engine), federation is refused with a diagnostic explaining the incompatibility.
CLI preview¶
Exchanges baselines with a peer and outputs the negotiated baseline. This command will be implemented when federation reaches Level 3 (ADR-0023).
Full walkthrough: legacy data → new schema → enrich → validate¶
Here is the complete journey, end to end. We will migrate a legacy employee database to a modern schema, enrich it with derived and computed fields, and validate the result against a baseline.
Step 1: Start with legacy data¶
Assume a graph populated from a legacy system with flat, abbreviated field names:
from trails import capability
@capability
def import_legacy(ctx) -> dict:
"""Load legacy employee data."""
records = [
{"emp_nm": "Alice Schmidt", "dept_cd": "ENG", "sal_amt": 95000},
{"emp_nm": "Bob Müller", "dept_cd": "MKT", "sal_amt": 82000},
{"emp_nm": "Carol Weber", "dept_cd": "ENG", "sal_amt": 91000},
]
for rec in records:
ctx.kg.node(labels=["Employee"], properties=rec)
return {"imported": len(records)}
Step 2: Define the target schema¶
The modern schema uses readable field names, splits the full name, and adds fields for enrichment:
from trails.orm import node_type, placeholder
@node_type("Employee", fields={
"first_name": str,
"last_name": str,
"department": str,
"salary": int,
"seniority_level": (str, placeholder), # to be enriched
"department_full": (str, placeholder), # to be enriched
})
class Employee: ...
Step 3: Plan the transformation¶
Output:
Transformation plan: legacy → v2
[split] emp_nm → first_name + last_name (structural)
[rename] dept_cd → department (fuzzy match, 0.87)
[rename] sal_amt → salary (fuzzy match, 0.83)
[placeholder] seniority_level — no source equivalent
[placeholder] department_full — no source equivalent
3 mappable, 2 placeholders
Review the plan. Adjust if needed. Then execute:
Step 4: Register enrichment functions¶
from trails.enrichment import enrichment
@enrichment(target_type="Employee", field="department_full")
def expand_department(node, ctx):
"""Expand department codes to full names — pure lookup, zero cost."""
codes = {"ENG": "Engineering", "MKT": "Marketing", "FIN": "Finance"}
return codes.get(node.department, node.department)
@enrichment(target_type="Employee", field="seniority_level",
depends_on=["department_full"])
def estimate_seniority(node, ctx):
"""Derive seniority from salary bands — no LLM needed."""
if node.salary >= 100000:
return "senior"
elif node.salary >= 85000:
return "mid"
return "junior"
Both functions are deterministic and free. No LLM calls, no API costs.
Step 5: Run enrichment¶
Output:
Enriching Employee (3 instances):
department_full: 3 / 3 filled (expand_department)
seniority_level: 3 / 3 filled (estimate_seniority)
Total cost: $0.00
Duration: 0.1s
Check the result:
Employee (3 instances):
first_name: 100% filled (3 / 3)
last_name: 100% filled (3 / 3)
department: 100% filled (3 / 3)
salary: 100% filled (3 / 3)
seniority_level: 100% filled (3 / 3) — enriched
department_full: 100% filled (3 / 3) — enriched
Step 6: Define and validate a baseline¶
Create baselines/active.toml:
[baseline]
name = "employee-app"
version = "1.0"
extends = "trails:strict"
[baseline.schema]
core_types = ["Employee"]
shacl_strictness = "closed"
[baseline.agents]
default_model = "haiku"
max_cost_per_run_usd = 1.0
budget_enforcement = true
Validate:
baseline: schema — PASS
core_types: "Employee" registered
shacl_strictness: closed
baseline: agents — PASS
budget_enforcement: on
max_cost_per_run_usd: 1.0
2 passed, 0 warnings, 0 failures
Step 7: Verify provenance¶
Every step left a trail. Query it:
@capability
def transformation_history(ctx) -> list:
"""Show what happened to our data."""
results = ctx.kg.sparql("""
SELECT ?entity ?activity ?kind ?timestamp
WHERE {
?entity prov:wasGeneratedBy ?activity .
?activity trails:activityKind ?kind ;
prov:endedAtTime ?timestamp .
}
ORDER BY ?timestamp
LIMIT 20
""")
return [dict(r) for r in results]
The provenance graph records the full chain: ingestion → transformation → enrichment, with timestamps, function names, and costs at every step.
The progressive story¶
Schema transformation, enrichment, and baselines follow the same progressive-enhancement principle as the rest of Trails:
| Level | What you use | What you get |
|---|---|---|
| 0 | Manual SPARQL CONSTRUCT | Hand-written transformation queries |
| 1 | trails onto transform |
Auto-generated field mappings, dry-run |
| 2 | Placeholder fields + @enrichment |
Typed gaps + registered fillers |
| 3 | trails enrich run |
Batch enrichment with provenance |
| 4 | trails enrich agent |
Autonomous gap-filling with budget control |
| — | trails baseline validate |
Configuration compliance checking |
Each level is additive. A project at Level 1 never sees enrichment. A project at Level 3 may never need Level 4. Manual SPARQL CONSTRUCT continues to work at every level.
Baselines are orthogonal — they validate configuration at any level,
from a bare trails.toml to a fully enriched, federated deployment.
See also¶
- Auto-Ontology — schema inference and generation, the precursor to transformation
- Data Integration — RML mappings for ingesting external data before transformation
- Trust and Policy — Cedar policies and provenance, which baselines can constrain
- Federation and Scale — multi-instance federation, where baseline negotiation applies
- ADR-0026 — the transformation and enrichment architecture decision
- ADR-0027 — the baselines architecture decision