ADR-0024: RML integration for declarative data mapping¶
- Status: Accepted (2026-04-19)
- Date: 2026-04-17
- Extends: ADR-0019 (App surface — ingestion slice)
- Relates to: ADR-0009 (Provenance always on), ADR-0021 (Progressive enhancement)
- Supersedes: —
- Superseded by: —
Context¶
trails.ingest (M10 Phase 1) provides code-driven extractors for
document formats: PDF, HTML, Markdown, DOCX, RTF, and plain text. Each
extractor is a Python function that returns str; the pipeline then
chunks, hashes, and loads the result as Document + Chunk nodes with
PROV-O provenance.
This works well for unstructured text, but structured and semi-structured data sources — CSV files, JSON APIs, relational databases, XML feeds — need custom extractor code today. Every new source shape requires a new Python function, a new MIME entry, new tests, and manual triple construction. The cost scales linearly with the number of source schemas an app consumes.
RML (RDF Mapping Language) is a W3C Community Group specification
that solves exactly this problem. RML extends R2RML (the W3C standard
for relational-to-RDF mapping) to support heterogeneous sources — CSV,
JSON (via JSONPath), XML (via XPath), and SQL databases — through
declarative mapping rules expressed as RDF (Turtle). A single
rml:TriplesMap declares the logical source, an iterator, a subject
template, and predicate-object maps that produce RDF triples. No
procedural code.
Key properties that make RML a natural fit for Trails:
- Declarative. Mappings are data, not code. They can be versioned, diffed, reviewed, and shared without touching Python.
- RDF-native. RML mappings are themselves RDF triples. They can live in the KG, be queried with SPARQL, and participate in provenance graphs.
- Mature tooling. Morph-KGC is an actively maintained, pip-installable Python RML processor that handles CSV, JSON, XML, and RDBMS sources out of the box.
- Progressive enhancement. Simple apps use code extractors and never see RML. Apps that outgrow code extractors add RML mappings incrementally — no mode switch, no migration.
Decision¶
1. trails.rml module¶
A new top-level module trails.rml wraps Morph-KGC as a thin
integration layer. Morph-KGC is an optional dependency installed
via:
The import is lazy: import trails.rml succeeds only when morph_kgc
is importable; otherwise it raises TrailsError with an actionable
install hint (same pattern as the ingest extractors).
Core surface:
from trails.rml import run_mapping, validate_mapping
# Execute a mapping, load results into the KG with PROV-O provenance.
result = run_mapping("mapping.ttl", ctx)
print(f"Generated {result.triple_count} triples, activity: {result.activity_iri}")
# Validate without executing — checks syntax, source reachability, term maps.
issues = validate_mapping("mapping.ttl")
for issue in issues:
print(f"{issue.severity}: {issue.message}")
2. CLI commands¶
# Execute an RML mapping and load results into the local KG.
trails rml run mapping.ttl [--source-override key=path ...] [--graph <named-graph>]
# Validate an RML mapping without executing.
trails rml validate mapping.ttl
trails rml run writes the generated triples into the kernel store and
emits a prov:Activity (see §6). --source-override lets the caller
rebind logical source paths at invocation time without editing the
mapping file — useful for CI pipelines that stage data in temp dirs.
3. Integration with trails.ingest¶
IngestPipeline (currently ingest_file / ingest_directory) gains
an rml_mapping= parameter:
from trails.ingest import ingest_file
# Code extractor (unchanged — progressive enhancement).
doc_iri = ingest_file("paper.pdf", ctx)
# RML-driven ingestion for structured data.
doc_iri = ingest_file(
"transactions.csv",
ctx,
rml_mapping="mappings/transactions.ttl",
)
When rml_mapping= is set, the pipeline delegates to trails.rml
instead of the code extractors. The chunker step is skipped — RML
produces typed triples directly, not flat text that needs splitting.
PROV-O provenance is emitted by the RML runner, not the ingest
pipeline, to avoid double-counting.
4. @rml_source decorator¶
Register a reusable data source that RML mappings can reference by logical name:
from trails.rml import rml_source
@rml_source("sales_csv", kind="csv")
def sales_data():
return "/data/exports/sales_2026.csv"
@rml_source("product_api", kind="json")
def product_feed():
return "https://api.example.com/products"
The decorator registers the source in an in-process registry keyed by
name. run_mapping resolves rml:logicalSource references against
this registry so mappings can use stable logical names instead of
hardcoded paths. Sources registered via @rml_source also appear in
trails rml validate output (reachability check).
5. Supported source types¶
Phase 1 (this ADR): CSV, JSON, XML files — the three non-SQL
source types Morph-KGC supports natively via rml:referenceFormulation.
Phase 2 (follow-on work): RDBMS via SQLAlchemy
(trails.rml.sql_source), REST APIs via trails.rml.http_source with
built-in pagination and retry. Phase 2 is out of scope for this ADR but
the module structure is designed to accommodate it without breaking
changes.
Phase 3 (future ADR): Streaming sources (Kafka, MQTT). Requires a different execution model (continuous rather than batch); will be proposed separately when a real use case materialises.
6. Provenance¶
Every run_mapping execution emits PROV-O triples in the framework
provenance graph (https://trails.dev/ns/prov/), consistent with
ADR-0009:
<activity/rml/019…> a prov:Activity ;
prov:startedAtTime "2026-04-17T14:30:00Z"^^xsd:dateTime ;
prov:endedAtTime "2026-04-17T14:30:02Z"^^xsd:dateTime ;
prov:used <mapping.ttl> ; # the mapping file
prov:used <file:///data/sales.csv> ; # the logical source(s)
trails:activityKind "trails.rml.run_mapping" ;
trails:tripleCount 4287 .
# Every generated triple's subject is linked as prov:generated.
<activity/rml/019…> prov:generated <iri1>, <iri2>, … .
The mapping file itself is asserted as a prov:Entity so downstream
trace tooling (trails trace) can answer "which mapping produced this
triple?" without scraping activity IRIs. When the mapping file lives
in the KG (see §1 — mappings are RDF), the provenance chain is fully
internal.
7. Progressive enhancement¶
RML integration follows the north-star ADR-0021:
| Stage | Surface | RML involvement |
|---|---|---|
| Simple ingest | ingest_file("paper.pdf", ctx) |
None. Code extractors handle it. |
| Structured data | ingest_file("data.csv", ctx, rml_mapping="m.ttl") |
User writes a Turtle mapping. |
| Schema inference | trails rml generate --from data.csv |
Auto-generated RML from CSV headers / JSON schema. Ties into the auto-ontology work (future ADR). |
Users never need RML. Users who need it get it without a mode switch.
The same trails.ingest pipeline, the same provenance graph, the same
trails trace tooling.
Non-goals¶
- Reimplementing an RML processor. Morph-KGC exists, is actively maintained, and handles the spec. Trails wraps it; it does not compete with it.
- SPARQL-based data sources in Phase 1. That is federation
(ADR-0008 / MCP transport territory), not mapping. A future ADR may
bridge RML's
rml:querywith the federation layer. - R2RML-only support. Trails targets full RML with logical sources, not the R2RML subset restricted to SQL. R2RML mappings are valid RML and will work, but the surface is RML-first.
- GUI mapping editor. Out of scope. Third-party tools (RMLEditor,
YARRRML) can produce the Turtle that
trails rml runconsumes.
Consequences¶
Positive
- Structured data sources (CSV, JSON, XML) become first-class without writing Python extractors. A Turtle mapping file is all that's needed.
- Mappings are versionable, diffable, auditable artifacts — better traceability than procedural code for data-pipeline compliance.
- PROV-O provenance covers the full chain: which mapping, which source,
when, how many triples.
trails traceworks unchanged. - The
@rml_sourceregistry decouples mapping logic from deployment paths — the same mapping runs in dev (local CSV) and prod (S3 export) by rebinding the source. - RML mappings stored in the KG are queryable: "show me all mappings
that reference the
salessource" is a SPARQL query.
Negative
- New optional dependency (
morph-kgc). Mitigation: lazy import, not in the default install. Users who never touch RML pay nothing. - RML has a learning curve for users unfamiliar with RDF mapping
languages. Mitigation: the
trails rml generatehelper (Phase 2) bootstraps a mapping from a CSV/JSON schema; users edit rather than write from scratch. - Morph-KGC's output is an in-memory RDF graph (rdflib). Loading it into Oxigraph requires serialisation round-trip (N-Triples → parse). Mitigation: N-Triples is the fastest serialisation format for bulk load; the overhead is negligible for datasets under 1M triples and acceptable up to ~10M.
Neutral
- Storage model unchanged. RML-generated triples live in the same Oxigraph store as everything else. No new named-graph convention beyond the existing provenance graph.
- The code-extractor path is unaffected.
ingest_file("paper.pdf", ctx)continues to work exactly as today.
Relationship to other ADRs¶
| ADR | Impact |
|---|---|
| ADR-0009 (Provenance always on) | Extended: RML activities use the same PROV-O pattern as ingest activities. New activityKind value trails.rml.run_mapping. |
| ADR-0017 (ActiveGraph ORM) | Unchanged. RML-generated triples are raw triples, not ORM nodes. Users who want ORM access define @node_type classes whose IRIs match the mapping output — additive, per ADR-0021. |
| ADR-0019 (App surface) | Extended: RML is a new ingestion pathway alongside code extractors. |
| ADR-0021 (Progressive enhancement) | Aligned. RML is an additive feature. No mode switch. |
| ADR-0022 (Cedar unified matcher) | Unchanged. Cedar policies match on the triples' types regardless of how they were produced. |
Alternatives considered¶
-
Custom
trails.ingestextractors for CSV / JSON / XML. Rejected. Each new source schema needs a new function, new tests, manual triple construction. Scales linearly; RML scales with mapping files. -
SPARQL CONSTRUCT from a staging graph. Load raw data as untyped triples, then CONSTRUCT the target shape. Rejected: requires two-phase ingest, staging graph lifecycle management, and SPARQL fluency for what should be a mapping concern.
-
Use YARRRML (YAML-based RML syntax) as the primary surface. Rejected for Phase 1. YARRRML is friendlier but adds a transpilation step and another spec to track. The Turtle-native surface is closer to the KG substrate. YARRRML support can be added as a convenience layer in Phase 2 (parse YAML → emit Turtle → hand to
run_mapping). -
Direct rdflib integration without Morph-KGC. Rejected. Building an RML processor from scratch contradicts the non-goal. Morph-KGC is W3C-test-suite-compliant and actively maintained.
Open questions¶
-
Batch size for large mappings. Morph-KGC materialises the full output graph in memory. For very large sources (>10M rows) this may require chunked execution (partition the source, run N sub-mappings, merge). Should the chunking be automatic or caller-controlled? Recommendation: caller-controlled in Phase 1 (pass
--chunk-sizetotrails rml run); automatic partitioning as a Phase 2 optimisation. -
Named-graph placement. Should RML output go into the default graph or a mapping-specific named graph? The provenance activity already records the mapping origin, so a separate named graph is redundant for lineage. But named graphs simplify "delete everything this mapping produced" for re-runs. Recommendation: default graph, with an opt-in
--graphflag ontrails rml run. -
Mapping-file storage. Should mappings be stored in the KG (as
prov:Entitynodes with the Turtle serialised in a literal) or on the filesystem? Recommendation: filesystem-first (mappings live in the project repo, versioned by git). KG storage as an opt-in for apps that need runtime-editable mappings. -
trails rml generatescope. The auto-generation helper is listed under Phase 2 / progressive enhancement. Should it infer OWL classes from CSV headers, or stop atrdf:type+ literal properties? Recommendation: start withrdf:type+ literals; OWL inference belongs in the auto-ontology ADR.