ADR-0026: Schema Transformation and Agent-Driven Enrichment¶
- Status: Accepted (2026-04-19)
- Date: 2026-04-17
- Depends on: ADR-0012 (Cost as framework primitive), ADR-0017 (ActiveGraph ORM), ADR-0021 (Progressive enhancement), ADR-0024 (RML declarative mapping), ADR-0025 (Auto-ontology generation)
- Supersedes: —
- Superseded by: —
Context¶
Real-world knowledge-graph projects constantly migrate between schemas: legacy systems to standards, v1 to v2, domain A to domain B. Every schema change triggers a cascade of manual work — writing SPARQL CONSTRUCT queries, mapping field names, coercing types, handling structural mismatches (flat to nested, merged to split). Traditional ETL is code-heavy and brittle: every field mapping is hand-coded, every structural change requires a new transformation function, and the maintenance burden scales linearly with schema count.
Three forces converge to make this problem urgent for Trails:
-
Scientists and engineers introduce new fields during transformation that will be filled later. A clinical researcher migrating from a study-specific schema to a FHIR-aligned one adds a
sentimentfield that does not exist in the source. A data engineer addsquality_scoreto track enrichment quality. These placeholder fields are legitimate — they mark intent to enrich — but current tools either reject them (strict validation) or silently ignore them (untyped property soup). -
Agents are natural enrichment workers. An agent can observe empty fields, decide which ones it can fill, and fill them using LLM inference, external API lookups, or SPARQL derivation from existing data. The ReAct planner (M9) already supports this loop. The cost envelope system (ADR-0012) already tracks the budget. What's missing is the machinery to connect unfilled placeholders to enrichment functions.
-
Trails already has the building blocks. RML (ADR-0024) maps external sources to the KG.
onto_infer(M14) discovers schemas from data.onto_generate(M14) creates schemas from descriptions.onto_evolution(M4) diffs and migrates schemas. The agent runtime (M9) provides planners and LLM clients. What's missing is the transformation engine that maps one KG schema to another and the enrichment pipeline that fills the gaps.
Decision¶
Four mechanisms, layered progressively. Each builds on the previous; none is required.
1. trails onto transform — schema-to-schema transformation engine¶
A new CLI command and Python API that generates and executes transformations between two KG schemas.
Inputs:
- Source schema: a TTL file (SHACL shapes) or Python module with
@node_typedeclarations. - Target schema: same formats.
Process:
-
Auto-mapping. Fields with matching names and compatible types are mapped automatically.
title: strin source →title: strin target requires zero configuration. -
LLM-assisted mapping. Fields with ambiguous correspondence (
author_name→creator.full_name,pub_date→published_at) are proposed by a cheap LLM call. The user confirms or overrides. Cost-tracked per ADR-0012. -
Transformation code generation. The engine produces one of:
- SPARQL CONSTRUCT queries (preferred for simple renames, type coercion, 1:1 structural changes).
-
Python transformation functions using the ORM (ADR-0017) for complex cases: splitting/merging fields, conditional logic, multi-step derivations.
-
Execution.
trails onto transform --executeruns the generated transformation against the store. Without--execute, the command produces the transformation plan (dry-run by default).
Supported transformations:
| Category | Example | Strategy |
|---|---|---|
| Field rename | author_name → creator_name |
SPARQL CONSTRUCT with predicate substitution |
| Type coercion | "42" (string) → 42 (integer) |
SPARQL BIND with cast |
| Field split | full_name → first_name + last_name |
Python function (regex or LLM) |
| Field merge | first_name + last_name → full_name |
SPARQL CONCAT |
| Structural change | flat author string → nested Author node |
SPARQL CONSTRUCT + INSERT for new nodes |
| Enumeration mapping | "M"/"F" → "male"/"female" |
SPARQL VALUES lookup table |
| Unmappable fields | target field has no source equivalent | Marked as placeholder (see §2) |
CLI surface:
# Generate transformation plan (dry-run, default)
trails onto transform --source models/v1.py --target models/v2.py
# Generate and show the SPARQL / Python code
trails onto transform --source old.ttl --target new.ttl --show-code
# Execute the transformation
trails onto transform --source models/v1.py --target models/v2.py --execute
# Use LLM for ambiguous field mappings
trails onto transform --source old.ttl --target new.ttl --llm-assist
# Mark unmappable target fields as placeholders
trails onto transform --source old.ttl --target new.ttl --enrich
Python API:
from trails.onto_transform import TransformPlan, generate_plan, execute_plan
plan: TransformPlan = generate_plan(
source_schema="models/v1.py",
target_schema="models/v2.py",
llm_assist=True,
ctx=ctx,
)
# Inspect before executing
for mapping in plan.field_mappings:
print(f"{mapping.source} → {mapping.target} ({mapping.strategy})")
for placeholder in plan.placeholders:
print(f" [placeholder] {placeholder.field} — no source equivalent")
# Execute
result = execute_plan(plan, ctx)
print(f"Transformed {result.node_count} nodes, {result.triple_count} triples")
2. Placeholder fields and enrichment markers¶
@node_type gains a placeholder=True marker on field definitions to
declare "this field exists in the schema but is not yet populated."
Syntax:
from trails.orm import node_type, placeholder
@node_type("Article", fields={
"title": str,
"body": str,
"sentiment": (str, placeholder), # not yet filled
"quality_score": (float, placeholder), # not yet filled
"summary": (str, placeholder), # not yet filled
})
class Article:
"""An article with enrichment placeholders."""
pass
Semantics:
- Placeholder fields are nullable by default — writes that omit them succeed without validation errors.
- SHACL export emits
sh:minCount 0for placeholder fields regardless of the type's normal cardinality rules. Astrplaceholder does not generatesh:minCount 1. - The
placeholdermarker is metadata, not a type. The field's actual type (str,float, etc.) is preserved for validation when a value IS written. trails doctorreports placeholder fill rates:
Integration with trails onto transform:
When --enrich is passed, target fields that have no source equivalent
are automatically marked as placeholders in the generated code. This
bridges transformation and enrichment: transform what you can, mark what
you can't, enrich later.
3. Enrichment pipelines¶
A decorator-based registration system for functions that fill placeholder fields.
Syntax:
from trails.enrichment import enrichment
@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
"""Fill the sentiment field using LLM analysis."""
response = ctx.llm.complete(
f"Classify the sentiment of this text as positive, negative, "
f"or neutral: {node.body[:500]}",
max_tokens=10,
)
return response.text.strip().lower()
@enrichment(target_type="Article", field="quality_score")
def compute_quality(node, ctx):
"""Derive quality score from existing fields."""
score = 0.0
if node.title and len(node.title) > 10:
score += 0.3
if node.body and len(node.body) > 200:
score += 0.5
if node.sentiment:
score += 0.2
return round(score, 2)
@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
"""Generate a summary using LLM."""
response = ctx.llm.complete(
f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
max_tokens=100,
)
return response.text.strip()
Execution:
# Run all registered enrichments on unfilled placeholders
trails enrich run
# Run enrichments for a specific type
trails enrich run --type Article
# Run a specific field enrichment
trails enrich run --type Article --field sentiment
# Show fill rates per field
trails enrich status
# Dry-run: show what would be enriched without executing
trails enrich run --dry-run
Semantics:
trails enrich runiterates over all instances of the target type, checks which placeholder fields are unfilled (Noneor absent), and calls the registered enrichment function for each.- Enrichment functions receive the node (ORM instance per ADR-0017) and
a context with
ctx.llm,ctx.kg, andctx.cost. - Cost-tracked. Each
trails enrich runinvocation opens aCostTrackerenvelope taggedenrich:<type>:<field>per ADR-0012. The CLI reports total cost at completion. - Batch-aware. Enrichment functions that call
ctx.llmbenefit from the Batches API (50% discount) when--batchis passed. Functions are grouped by field, and prompts are batched. - PROV-O provenance. Each enriched value emits a
prov:Activitylinking the enrichment function (asprov:Agent) to the filled value (asprov:Entity). The activity records the model used, cost, and timestamp.trails tracecan answer "who filled this field and when?" - Idempotent. Running
trails enrich runtwice skips already-filled fields. Re-enrichment requires--force. - Order-aware. Enrichment functions can declare dependencies:
@enrichment(..., depends_on=["sentiment"])ensuressentimentis filled beforequality_scoreruns. The scheduler resolves the dependency graph and executes in topological order.
4. Agent-driven enrichment (advanced)¶
For fully autonomous enrichment, trails enrich agent spawns an agent
that discovers unfilled placeholders and attempts to fill them without
pre-registered enrichment functions.
Process (ReAct planner from M9):
-
Observe. The agent queries the store for all placeholder fields with fill rates below 100%. It ranks fields by estimated difficulty: type coercion (easy) → lookup from existing data (medium) → LLM inference (hard).
-
Plan. For each unfilled field, the agent decides a strategy:
- Derive: compute the value from other fields on the same node
(e.g.,
full_namefromfirst_name+last_name). - Lookup: query external APIs or reference data.
- Infer: use LLM to generate a plausible value from context.
-
Skip: mark as "cannot fill" with a reason.
-
Act. The agent executes the strategy, fills the value, and records provenance.
-
Verify. The agent checks the filled value against the field's type constraint. Invalid values are rolled back with a warning.
CLI surface:
# Spawn enrichment agent
trails enrich agent
# With budget constraint (stops when exhausted)
trails enrich agent --budget 5.00
# Target specific types
trails enrich agent --type Article --type Author
# Progressive mode: easy fields first, then hard
trails enrich agent --progressive
Budget enforcement: the agent operates within a cost envelope per
ADR-0012. When the budget is exhausted, it stops, reports what was
filled, and what remains. The --progressive flag ensures cheap
operations (type coercion, string manipulation) execute before expensive
ones (LLM calls), maximising fill rate per dollar.
Provenance: every agent-filled value links to a prov:Activity
with trails:activityKind "trails.enrich.agent" and the agent's
session ID. The full ReAct trace (observations, plans, actions) is
recorded in the session log per M9's session persistence.
Progressive enhancement integration¶
The four mechanisms map onto ADR-0021's additive model:
| Level | Mechanism | What it adds |
|---|---|---|
| 0 | Manual SPARQL CONSTRUCT | Status quo — hand-written transformation queries |
| 1 | trails onto transform |
Auto-generated field mappings with dry-run |
| 2 | Placeholder fields + @enrichment |
Typed gaps in the schema + registered fillers |
| 3 | trails enrich run |
Batch execution of enrichment pipelines |
| 4 | trails enrich agent |
Autonomous agent fills gaps using ReAct planning |
Each level is additive. A project that uses Level 1 never sees enrichment. A project that uses Level 3 may never need Level 4. Users who write manual SPARQL CONSTRUCT today keep doing so — nothing changes for them.
Non-goals¶
- Full ETL framework. Trails is not Airflow. No scheduling, no
streaming, no DAG orchestration.
trails enrich runis a one-shot batch command; recurring execution is the caller's responsibility (cron, CI pipeline, external scheduler). - Schema versioning system. Schema versions are tracked by git +
trails onto evolve(M4). This ADR does not add a version registry, migration history table, or rollback mechanism beyond what git provides. - Replacing RML. RML (ADR-0024) maps external sources into the KG.
trails onto transformmaps one KG schema to another KG schema. Different problems, complementary solutions. A typical pipeline is: RML to ingest, transform to reshape, enrich to fill gaps. - Real-time enrichment. Enrichment is batch, not streaming. A write-time enrichment hook (enrich on insert) is out of scope; it introduces latency, error-handling complexity, and cost unpredictability that batch avoids.
- Unsupervised schema changes.
trails onto transform --executemodifies the store, but the transformation plan is always generated first and shown to the user. No silent rewrites.
Consequences¶
Positive
- Schema migration becomes declarative: source + target in, plan out. The manual SPARQL CONSTRUCT grind is replaced by auto-generated code that the user reviews and executes.
- Placeholder fields make "not yet populated" a first-class concept.
No more
nullambiguity — a placeholder explicitly declares intent to enrich. - Enrichment pipelines connect the schema layer to the agent runtime. The same agent that queries the KG can fill gaps in it, with cost tracking and provenance.
- Progressive enhancement works: a project starts with manual SPARQL,
adds
trails onto transformwhen schemas multiply, adds enrichment when agents are ready. No mode switches. - PROV-O coverage extends to enrichment: every filled value has a
provenance chain.
trails traceanswers "where did this value come from?" whether the value was ingested, transformed, or enriched.
Negative
- New CLI surface:
trails onto transform,trails enrich run/status/agent. More commands to document and maintain. - LLM-assisted mapping and agent-driven enrichment are non-deterministic. Two runs may produce different mappings or different filled values. Mitigation: dry-run by default, provenance on every write, deterministic strategies preferred over LLM when possible.
- Placeholder fields add a concept to the ORM surface. Users must learn
(type, placeholder)syntax. Mitigation: optional — users who never use placeholders never see them. - Enrichment ordering (dependency graph) adds complexity. Mitigation:
order is resolved automatically; users only declare
depends_onwhen needed.
Neutral
- Storage model unchanged. Transformed and enriched triples live in the same Oxigraph store as everything else. No new named-graph conventions.
- The ORM surface (
@node_type) is extended, not replaced. Existing@node_typedeclarations without placeholders work identically. - Cost model unchanged. Enrichment costs flow through the existing
CostTracker(ADR-0012). No new billing surface.
Relationship to other ADRs¶
| ADR | Relationship |
|---|---|
| ADR-0012 (Cost as framework primitive) | Extended: enrichment runs open cost envelopes; agent enrichment respects budget constraints. |
| ADR-0017 (ActiveGraph ORM) | Extended: @node_type gains placeholder marker; enrichment functions receive ORM instances. |
| ADR-0021 (Progressive enhancement) | Aligned: transformation and enrichment are additive layers. No mode switches. |
| ADR-0024 (RML declarative mapping) | Complementary: RML maps external sources → KG; transform maps KG schema → KG schema. Pipeline: ingest via RML, reshape via transform, fill via enrich. |
| ADR-0025 (Auto-ontology generation) | Complementary: onto infer discovers source schema, onto generate designs target schema, onto transform migrates between them. |
| ADR-0009 (Provenance always on) | Extended: enrichment activities use PROV-O. New activityKind values: trails.onto_transform.execute, trails.enrich.run, trails.enrich.agent. |
| ADR-0018 (Agent runtime) | Extended: trails enrich agent uses the ReAct planner and session persistence from M9. |
Alternatives considered¶
-
Extend RML to handle KG-to-KG transformation. Rejected. RML's data model assumes external sources (CSV, JSON, XML, SQL) with iterators and reference formulations. KG-to-KG transformation is a different problem: the data is already triples, the mapping is schema-to-schema, and the operations (rename, coerce, split, merge) don't map to RML's source/iterator/subject-map model. Forcing RML into this role would require non-standard extensions that break compatibility with Morph-KGC.
-
SPARQL CONSTRUCT as the only transformation mechanism. Rejected. SPARQL CONSTRUCT is powerful but requires SPARQL fluency. Auto- generating CONSTRUCT queries from schema diffs gives users the same power without the expertise requirement. The generated CONSTRUCT is still visible (
--show-code) and editable — no abstraction is lost. -
External ETL tool integration (dbt, Airbyte, etc.). Rejected for the transformation layer. External tools solve extract-load, not schema-to-schema transformation within a KG. Enrichment could theoretically delegate to external tools, but the cost-tracking and provenance requirements make in-framework enrichment more traceable.
-
Write-time enrichment hooks (enrich on insert). Rejected for v1. Real-time enrichment adds latency to every write, makes error handling complex (what if the LLM is down?), and makes cost unpredictable. Batch enrichment via
trails enrich runis simpler, cheaper (Batches API eligibility), and debuggable. Write-time hooks may be revisited in a future ADR when real-time requirements materialise. -
Schema diffing only (no enrichment). Rejected. Schema transformation without enrichment leaves unmappable fields as dead ends. The placeholder + enrichment pipeline turns dead ends into actionable work items that agents can execute. The two features are designed together because they solve the same workflow: reshape data, then fill the gaps.
Open questions¶
-
Conflict resolution for overlapping enrichments. If two enrichment functions target the same field, which wins? Current design: first-registered wins;
--forceallows re-enrichment. Should there be a priority system or a "best of N" evaluator? Recommendation: first-registered for v1; priority system as a follow-on if real use cases demand it. -
Incremental transformation. Should
trails onto transformsupport incremental mode (transform only nodes added since last run)? For large stores, full re-transformation is expensive. Recommendation: full transformation for v1; incremental as an optimisation when stores exceed ~100k nodes, using the provenance activity timestamps as the high-water mark. -
Enrichment function testing. How should users test enrichment functions in isolation? The
ctx.llmdependency makes unit testing non-trivial. Recommendation:LLMClient.mock()(from M9) for unit tests;trails enrich run --dry-runfor integration tests. Document the testing pattern in the enrichment guide. -
Cross-type enrichment. Should an enrichment function be able to read from type A to fill a field on type B (e.g., fill
Article.author_bioby looking upAuthor.biography)? Recommendation: yes — the enrichment function receivesctx.kgand can query any type. No restriction on read scope; write scope is limited to the declared target type and field.