Skip to content

ADR-0026: Schema Transformation and Agent-Driven Enrichment

  • Status: Accepted (2026-04-19)
  • Date: 2026-04-17
  • Depends on: ADR-0012 (Cost as framework primitive), ADR-0017 (ActiveGraph ORM), ADR-0021 (Progressive enhancement), ADR-0024 (RML declarative mapping), ADR-0025 (Auto-ontology generation)
  • Supersedes:
  • Superseded by:

Context

Real-world knowledge-graph projects constantly migrate between schemas: legacy systems to standards, v1 to v2, domain A to domain B. Every schema change triggers a cascade of manual work — writing SPARQL CONSTRUCT queries, mapping field names, coercing types, handling structural mismatches (flat to nested, merged to split). Traditional ETL is code-heavy and brittle: every field mapping is hand-coded, every structural change requires a new transformation function, and the maintenance burden scales linearly with schema count.

Three forces converge to make this problem urgent for Trails:

  1. Scientists and engineers introduce new fields during transformation that will be filled later. A clinical researcher migrating from a study-specific schema to a FHIR-aligned one adds a sentiment field that does not exist in the source. A data engineer adds quality_score to track enrichment quality. These placeholder fields are legitimate — they mark intent to enrich — but current tools either reject them (strict validation) or silently ignore them (untyped property soup).

  2. Agents are natural enrichment workers. An agent can observe empty fields, decide which ones it can fill, and fill them using LLM inference, external API lookups, or SPARQL derivation from existing data. The ReAct planner (M9) already supports this loop. The cost envelope system (ADR-0012) already tracks the budget. What's missing is the machinery to connect unfilled placeholders to enrichment functions.

  3. Trails already has the building blocks. RML (ADR-0024) maps external sources to the KG. onto_infer (M14) discovers schemas from data. onto_generate (M14) creates schemas from descriptions. onto_evolution (M4) diffs and migrates schemas. The agent runtime (M9) provides planners and LLM clients. What's missing is the transformation engine that maps one KG schema to another and the enrichment pipeline that fills the gaps.

Decision

Four mechanisms, layered progressively. Each builds on the previous; none is required.

1. trails onto transform — schema-to-schema transformation engine

A new CLI command and Python API that generates and executes transformations between two KG schemas.

Inputs:

  • Source schema: a TTL file (SHACL shapes) or Python module with @node_type declarations.
  • Target schema: same formats.

Process:

  1. Auto-mapping. Fields with matching names and compatible types are mapped automatically. title: str in source → title: str in target requires zero configuration.

  2. LLM-assisted mapping. Fields with ambiguous correspondence (author_namecreator.full_name, pub_datepublished_at) are proposed by a cheap LLM call. The user confirms or overrides. Cost-tracked per ADR-0012.

  3. Transformation code generation. The engine produces one of:

  4. SPARQL CONSTRUCT queries (preferred for simple renames, type coercion, 1:1 structural changes).
  5. Python transformation functions using the ORM (ADR-0017) for complex cases: splitting/merging fields, conditional logic, multi-step derivations.

  6. Execution. trails onto transform --execute runs the generated transformation against the store. Without --execute, the command produces the transformation plan (dry-run by default).

Supported transformations:

Category Example Strategy
Field rename author_namecreator_name SPARQL CONSTRUCT with predicate substitution
Type coercion "42" (string) → 42 (integer) SPARQL BIND with cast
Field split full_namefirst_name + last_name Python function (regex or LLM)
Field merge first_name + last_namefull_name SPARQL CONCAT
Structural change flat author string → nested Author node SPARQL CONSTRUCT + INSERT for new nodes
Enumeration mapping "M"/"F""male"/"female" SPARQL VALUES lookup table
Unmappable fields target field has no source equivalent Marked as placeholder (see §2)

CLI surface:

# Generate transformation plan (dry-run, default)
trails onto transform --source models/v1.py --target models/v2.py

# Generate and show the SPARQL / Python code
trails onto transform --source old.ttl --target new.ttl --show-code

# Execute the transformation
trails onto transform --source models/v1.py --target models/v2.py --execute

# Use LLM for ambiguous field mappings
trails onto transform --source old.ttl --target new.ttl --llm-assist

# Mark unmappable target fields as placeholders
trails onto transform --source old.ttl --target new.ttl --enrich

Python API:

from trails.onto_transform import TransformPlan, generate_plan, execute_plan

plan: TransformPlan = generate_plan(
    source_schema="models/v1.py",
    target_schema="models/v2.py",
    llm_assist=True,
    ctx=ctx,
)

# Inspect before executing
for mapping in plan.field_mappings:
    print(f"{mapping.source}{mapping.target} ({mapping.strategy})")

for placeholder in plan.placeholders:
    print(f"  [placeholder] {placeholder.field} — no source equivalent")

# Execute
result = execute_plan(plan, ctx)
print(f"Transformed {result.node_count} nodes, {result.triple_count} triples")

2. Placeholder fields and enrichment markers

@node_type gains a placeholder=True marker on field definitions to declare "this field exists in the schema but is not yet populated."

Syntax:

from trails.orm import node_type, placeholder

@node_type("Article", fields={
    "title": str,
    "body": str,
    "sentiment": (str, placeholder),      # not yet filled
    "quality_score": (float, placeholder), # not yet filled
    "summary": (str, placeholder),         # not yet filled
})
class Article:
    """An article with enrichment placeholders."""
    pass

Semantics:

  • Placeholder fields are nullable by default — writes that omit them succeed without validation errors.
  • SHACL export emits sh:minCount 0 for placeholder fields regardless of the type's normal cardinality rules. A str placeholder does not generate sh:minCount 1.
  • The placeholder marker is metadata, not a type. The field's actual type (str, float, etc.) is preserved for validation when a value IS written.
  • trails doctor reports placeholder fill rates:
    Article:
      sentiment:     23% filled (115 / 500)
      quality_score:  0% filled (0 / 500)
      summary:       89% filled (445 / 500)
    

Integration with trails onto transform:

When --enrich is passed, target fields that have no source equivalent are automatically marked as placeholders in the generated code. This bridges transformation and enrichment: transform what you can, mark what you can't, enrich later.

3. Enrichment pipelines

A decorator-based registration system for functions that fill placeholder fields.

Syntax:

from trails.enrichment import enrichment

@enrichment(target_type="Article", field="sentiment")
def analyse_sentiment(node, ctx):
    """Fill the sentiment field using LLM analysis."""
    response = ctx.llm.complete(
        f"Classify the sentiment of this text as positive, negative, "
        f"or neutral: {node.body[:500]}",
        max_tokens=10,
    )
    return response.text.strip().lower()

@enrichment(target_type="Article", field="quality_score")
def compute_quality(node, ctx):
    """Derive quality score from existing fields."""
    score = 0.0
    if node.title and len(node.title) > 10:
        score += 0.3
    if node.body and len(node.body) > 200:
        score += 0.5
    if node.sentiment:
        score += 0.2
    return round(score, 2)

@enrichment(target_type="Article", field="summary")
def summarise(node, ctx):
    """Generate a summary using LLM."""
    response = ctx.llm.complete(
        f"Summarise this article in one sentence:\n\n{node.body[:2000]}",
        max_tokens=100,
    )
    return response.text.strip()

Execution:

# Run all registered enrichments on unfilled placeholders
trails enrich run

# Run enrichments for a specific type
trails enrich run --type Article

# Run a specific field enrichment
trails enrich run --type Article --field sentiment

# Show fill rates per field
trails enrich status

# Dry-run: show what would be enriched without executing
trails enrich run --dry-run

Semantics:

  • trails enrich run iterates over all instances of the target type, checks which placeholder fields are unfilled (None or absent), and calls the registered enrichment function for each.
  • Enrichment functions receive the node (ORM instance per ADR-0017) and a context with ctx.llm, ctx.kg, and ctx.cost.
  • Cost-tracked. Each trails enrich run invocation opens a CostTracker envelope tagged enrich:<type>:<field> per ADR-0012. The CLI reports total cost at completion.
  • Batch-aware. Enrichment functions that call ctx.llm benefit from the Batches API (50% discount) when --batch is passed. Functions are grouped by field, and prompts are batched.
  • PROV-O provenance. Each enriched value emits a prov:Activity linking the enrichment function (as prov:Agent) to the filled value (as prov:Entity). The activity records the model used, cost, and timestamp. trails trace can answer "who filled this field and when?"
  • Idempotent. Running trails enrich run twice skips already-filled fields. Re-enrichment requires --force.
  • Order-aware. Enrichment functions can declare dependencies: @enrichment(..., depends_on=["sentiment"]) ensures sentiment is filled before quality_score runs. The scheduler resolves the dependency graph and executes in topological order.

4. Agent-driven enrichment (advanced)

For fully autonomous enrichment, trails enrich agent spawns an agent that discovers unfilled placeholders and attempts to fill them without pre-registered enrichment functions.

Process (ReAct planner from M9):

  1. Observe. The agent queries the store for all placeholder fields with fill rates below 100%. It ranks fields by estimated difficulty: type coercion (easy) → lookup from existing data (medium) → LLM inference (hard).

  2. Plan. For each unfilled field, the agent decides a strategy:

  3. Derive: compute the value from other fields on the same node (e.g., full_name from first_name + last_name).
  4. Lookup: query external APIs or reference data.
  5. Infer: use LLM to generate a plausible value from context.
  6. Skip: mark as "cannot fill" with a reason.

  7. Act. The agent executes the strategy, fills the value, and records provenance.

  8. Verify. The agent checks the filled value against the field's type constraint. Invalid values are rolled back with a warning.

CLI surface:

# Spawn enrichment agent
trails enrich agent

# With budget constraint (stops when exhausted)
trails enrich agent --budget 5.00

# Target specific types
trails enrich agent --type Article --type Author

# Progressive mode: easy fields first, then hard
trails enrich agent --progressive

Budget enforcement: the agent operates within a cost envelope per ADR-0012. When the budget is exhausted, it stops, reports what was filled, and what remains. The --progressive flag ensures cheap operations (type coercion, string manipulation) execute before expensive ones (LLM calls), maximising fill rate per dollar.

Provenance: every agent-filled value links to a prov:Activity with trails:activityKind "trails.enrich.agent" and the agent's session ID. The full ReAct trace (observations, plans, actions) is recorded in the session log per M9's session persistence.

Progressive enhancement integration

The four mechanisms map onto ADR-0021's additive model:

Level Mechanism What it adds
0 Manual SPARQL CONSTRUCT Status quo — hand-written transformation queries
1 trails onto transform Auto-generated field mappings with dry-run
2 Placeholder fields + @enrichment Typed gaps in the schema + registered fillers
3 trails enrich run Batch execution of enrichment pipelines
4 trails enrich agent Autonomous agent fills gaps using ReAct planning

Each level is additive. A project that uses Level 1 never sees enrichment. A project that uses Level 3 may never need Level 4. Users who write manual SPARQL CONSTRUCT today keep doing so — nothing changes for them.

Non-goals

  • Full ETL framework. Trails is not Airflow. No scheduling, no streaming, no DAG orchestration. trails enrich run is a one-shot batch command; recurring execution is the caller's responsibility (cron, CI pipeline, external scheduler).
  • Schema versioning system. Schema versions are tracked by git + trails onto evolve (M4). This ADR does not add a version registry, migration history table, or rollback mechanism beyond what git provides.
  • Replacing RML. RML (ADR-0024) maps external sources into the KG. trails onto transform maps one KG schema to another KG schema. Different problems, complementary solutions. A typical pipeline is: RML to ingest, transform to reshape, enrich to fill gaps.
  • Real-time enrichment. Enrichment is batch, not streaming. A write-time enrichment hook (enrich on insert) is out of scope; it introduces latency, error-handling complexity, and cost unpredictability that batch avoids.
  • Unsupervised schema changes. trails onto transform --execute modifies the store, but the transformation plan is always generated first and shown to the user. No silent rewrites.

Consequences

Positive

  • Schema migration becomes declarative: source + target in, plan out. The manual SPARQL CONSTRUCT grind is replaced by auto-generated code that the user reviews and executes.
  • Placeholder fields make "not yet populated" a first-class concept. No more null ambiguity — a placeholder explicitly declares intent to enrich.
  • Enrichment pipelines connect the schema layer to the agent runtime. The same agent that queries the KG can fill gaps in it, with cost tracking and provenance.
  • Progressive enhancement works: a project starts with manual SPARQL, adds trails onto transform when schemas multiply, adds enrichment when agents are ready. No mode switches.
  • PROV-O coverage extends to enrichment: every filled value has a provenance chain. trails trace answers "where did this value come from?" whether the value was ingested, transformed, or enriched.

Negative

  • New CLI surface: trails onto transform, trails enrich run/status/agent. More commands to document and maintain.
  • LLM-assisted mapping and agent-driven enrichment are non-deterministic. Two runs may produce different mappings or different filled values. Mitigation: dry-run by default, provenance on every write, deterministic strategies preferred over LLM when possible.
  • Placeholder fields add a concept to the ORM surface. Users must learn (type, placeholder) syntax. Mitigation: optional — users who never use placeholders never see them.
  • Enrichment ordering (dependency graph) adds complexity. Mitigation: order is resolved automatically; users only declare depends_on when needed.

Neutral

  • Storage model unchanged. Transformed and enriched triples live in the same Oxigraph store as everything else. No new named-graph conventions.
  • The ORM surface (@node_type) is extended, not replaced. Existing @node_type declarations without placeholders work identically.
  • Cost model unchanged. Enrichment costs flow through the existing CostTracker (ADR-0012). No new billing surface.

Relationship to other ADRs

ADR Relationship
ADR-0012 (Cost as framework primitive) Extended: enrichment runs open cost envelopes; agent enrichment respects budget constraints.
ADR-0017 (ActiveGraph ORM) Extended: @node_type gains placeholder marker; enrichment functions receive ORM instances.
ADR-0021 (Progressive enhancement) Aligned: transformation and enrichment are additive layers. No mode switches.
ADR-0024 (RML declarative mapping) Complementary: RML maps external sources → KG; transform maps KG schema → KG schema. Pipeline: ingest via RML, reshape via transform, fill via enrich.
ADR-0025 (Auto-ontology generation) Complementary: onto infer discovers source schema, onto generate designs target schema, onto transform migrates between them.
ADR-0009 (Provenance always on) Extended: enrichment activities use PROV-O. New activityKind values: trails.onto_transform.execute, trails.enrich.run, trails.enrich.agent.
ADR-0018 (Agent runtime) Extended: trails enrich agent uses the ReAct planner and session persistence from M9.

Alternatives considered

  1. Extend RML to handle KG-to-KG transformation. Rejected. RML's data model assumes external sources (CSV, JSON, XML, SQL) with iterators and reference formulations. KG-to-KG transformation is a different problem: the data is already triples, the mapping is schema-to-schema, and the operations (rename, coerce, split, merge) don't map to RML's source/iterator/subject-map model. Forcing RML into this role would require non-standard extensions that break compatibility with Morph-KGC.

  2. SPARQL CONSTRUCT as the only transformation mechanism. Rejected. SPARQL CONSTRUCT is powerful but requires SPARQL fluency. Auto- generating CONSTRUCT queries from schema diffs gives users the same power without the expertise requirement. The generated CONSTRUCT is still visible (--show-code) and editable — no abstraction is lost.

  3. External ETL tool integration (dbt, Airbyte, etc.). Rejected for the transformation layer. External tools solve extract-load, not schema-to-schema transformation within a KG. Enrichment could theoretically delegate to external tools, but the cost-tracking and provenance requirements make in-framework enrichment more traceable.

  4. Write-time enrichment hooks (enrich on insert). Rejected for v1. Real-time enrichment adds latency to every write, makes error handling complex (what if the LLM is down?), and makes cost unpredictable. Batch enrichment via trails enrich run is simpler, cheaper (Batches API eligibility), and debuggable. Write-time hooks may be revisited in a future ADR when real-time requirements materialise.

  5. Schema diffing only (no enrichment). Rejected. Schema transformation without enrichment leaves unmappable fields as dead ends. The placeholder + enrichment pipeline turns dead ends into actionable work items that agents can execute. The two features are designed together because they solve the same workflow: reshape data, then fill the gaps.

Open questions

  1. Conflict resolution for overlapping enrichments. If two enrichment functions target the same field, which wins? Current design: first-registered wins; --force allows re-enrichment. Should there be a priority system or a "best of N" evaluator? Recommendation: first-registered for v1; priority system as a follow-on if real use cases demand it.

  2. Incremental transformation. Should trails onto transform support incremental mode (transform only nodes added since last run)? For large stores, full re-transformation is expensive. Recommendation: full transformation for v1; incremental as an optimisation when stores exceed ~100k nodes, using the provenance activity timestamps as the high-water mark.

  3. Enrichment function testing. How should users test enrichment functions in isolation? The ctx.llm dependency makes unit testing non-trivial. Recommendation: LLMClient.mock() (from M9) for unit tests; trails enrich run --dry-run for integration tests. Document the testing pattern in the enrichment guide.

  4. Cross-type enrichment. Should an enrichment function be able to read from type A to fill a field on type B (e.g., fill Article.author_bio by looking up Author.biography)? Recommendation: yes — the enrichment function receives ctx.kg and can query any type. No restriction on read scope; write scope is limited to the declared target type and field.