ADR-0024: RML integration for declarative data mapping¶

Status: Accepted (2026-04-19)
Date: 2026-04-17
Extends: ADR-0019 (App surface — ingestion slice)
Relates to: ADR-0009 (Provenance always on), ADR-0021 (Progressive enhancement)
Supersedes: —
Superseded by: —

Context¶

trails.ingest (M10 Phase 1) provides code-driven extractors for document formats: PDF, HTML, Markdown, DOCX, RTF, and plain text. Each extractor is a Python function that returns str; the pipeline then chunks, hashes, and loads the result as Document + Chunk nodes with PROV-O provenance.

This works well for unstructured text, but structured and semi-structured data sources — CSV files, JSON APIs, relational databases, XML feeds — need custom extractor code today. Every new source shape requires a new Python function, a new MIME entry, new tests, and manual triple construction. The cost scales linearly with the number of source schemas an app consumes.

RML (RDF Mapping Language) is a W3C Community Group specification that solves exactly this problem. RML extends R2RML (the W3C standard for relational-to-RDF mapping) to support heterogeneous sources — CSV, JSON (via JSONPath), XML (via XPath), and SQL databases — through declarative mapping rules expressed as RDF (Turtle). A single rml:TriplesMap declares the logical source, an iterator, a subject template, and predicate-object maps that produce RDF triples. No procedural code.

Key properties that make RML a natural fit for Trails:

Declarative. Mappings are data, not code. They can be versioned, diffed, reviewed, and shared without touching Python.
RDF-native. RML mappings are themselves RDF triples. They can live in the KG, be queried with SPARQL, and participate in provenance graphs.
Mature tooling. Morph-KGC is an actively maintained, pip-installable Python RML processor that handles CSV, JSON, XML, and RDBMS sources out of the box.
Progressive enhancement. Simple apps use code extractors and never see RML. Apps that outgrow code extractors add RML mappings incrementally — no mode switch, no migration.

Decision¶

1. `trails.rml` module¶

A new top-level module trails.rml wraps Morph-KGC as a thin integration layer. Morph-KGC is an optional dependency installed via:

pip install 'trails[rml]'

The import is lazy: import trails.rml succeeds only when morph_kgc is importable; otherwise it raises TrailsError with an actionable install hint (same pattern as the ingest extractors).

Core surface:

from trails.rml import run_mapping, validate_mapping

# Execute a mapping, load results into the KG with PROV-O provenance.
result = run_mapping("mapping.ttl", ctx)
print(f"Generated {result.triple_count} triples, activity: {result.activity_iri}")

# Validate without executing — checks syntax, source reachability, term maps.
issues = validate_mapping("mapping.ttl")
for issue in issues:
    print(f"{issue.severity}: {issue.message}")

2. CLI commands¶

# Execute an RML mapping and load results into the local KG.
trails rml run mapping.ttl [--source-override key=path ...] [--graph <named-graph>]

# Validate an RML mapping without executing.
trails rml validate mapping.ttl

trails rml run writes the generated triples into the kernel store and emits a prov:Activity (see §6). --source-override lets the caller rebind logical source paths at invocation time without editing the mapping file — useful for CI pipelines that stage data in temp dirs.

3. Integration with `trails.ingest`¶

IngestPipeline (currently ingest_file / ingest_directory) gains an rml_mapping= parameter:

from trails.ingest import ingest_file

# Code extractor (unchanged — progressive enhancement).
doc_iri = ingest_file("paper.pdf", ctx)

# RML-driven ingestion for structured data.
doc_iri = ingest_file(
    "transactions.csv",
    ctx,
    rml_mapping="mappings/transactions.ttl",
)

When rml_mapping= is set, the pipeline delegates to trails.rml instead of the code extractors. The chunker step is skipped — RML produces typed triples directly, not flat text that needs splitting. PROV-O provenance is emitted by the RML runner, not the ingest pipeline, to avoid double-counting.

4. `@rml_source` decorator¶

Register a reusable data source that RML mappings can reference by logical name:

from trails.rml import rml_source

@rml_source("sales_csv", kind="csv")
def sales_data():
    return "/data/exports/sales_2026.csv"

@rml_source("product_api", kind="json")
def product_feed():
    return "https://api.example.com/products"

The decorator registers the source in an in-process registry keyed by name. run_mapping resolves rml:logicalSource references against this registry so mappings can use stable logical names instead of hardcoded paths. Sources registered via @rml_source also appear in trails rml validate output (reachability check).

5. Supported source types¶

Phase 1 (this ADR): CSV, JSON, XML files — the three non-SQL source types Morph-KGC supports natively via rml:referenceFormulation.

Phase 2 (follow-on work): RDBMS via SQLAlchemy (trails.rml.sql_source), REST APIs via trails.rml.http_source with built-in pagination and retry. Phase 2 is out of scope for this ADR but the module structure is designed to accommodate it without breaking changes.

Phase 3 (future ADR): Streaming sources (Kafka, MQTT). Requires a different execution model (continuous rather than batch); will be proposed separately when a real use case materialises.

6. Provenance¶

Every run_mapping execution emits PROV-O triples in the framework provenance graph (https://trails.dev/ns/prov/), consistent with ADR-0009:

<activity/rml/019…> a prov:Activity ;
    prov:startedAtTime "2026-04-17T14:30:00Z"^^xsd:dateTime ;
    prov:endedAtTime   "2026-04-17T14:30:02Z"^^xsd:dateTime ;
    prov:used          <mapping.ttl> ;        # the mapping file
    prov:used          <file:///data/sales.csv> ;  # the logical source(s)
    trails:activityKind "trails.rml.run_mapping" ;
    trails:tripleCount  4287 .

# Every generated triple's subject is linked as prov:generated.
<activity/rml/019…> prov:generated <iri1>, <iri2>, … .

The mapping file itself is asserted as a prov:Entity so downstream trace tooling (trails trace) can answer "which mapping produced this triple?" without scraping activity IRIs. When the mapping file lives in the KG (see §1 — mappings are RDF), the provenance chain is fully internal.

7. Progressive enhancement¶

RML integration follows the north-star ADR-0021:

Stage	Surface	RML involvement
Simple ingest	`ingest_file("paper.pdf", ctx)`	None. Code extractors handle it.
Structured data	`ingest_file("data.csv", ctx, rml_mapping="m.ttl")`	User writes a Turtle mapping.
Schema inference	`trails rml generate --from data.csv`	Auto-generated RML from CSV headers / JSON schema. Ties into the auto-ontology work (future ADR).

Users never need RML. Users who need it get it without a mode switch. The same trails.ingest pipeline, the same provenance graph, the same trails trace tooling.

Non-goals¶

Reimplementing an RML processor. Morph-KGC exists, is actively maintained, and handles the spec. Trails wraps it; it does not compete with it.
SPARQL-based data sources in Phase 1. That is federation (ADR-0008 / MCP transport territory), not mapping. A future ADR may bridge RML's rml:query with the federation layer.
R2RML-only support. Trails targets full RML with logical sources, not the R2RML subset restricted to SQL. R2RML mappings are valid RML and will work, but the surface is RML-first.
GUI mapping editor. Out of scope. Third-party tools (RMLEditor, YARRRML) can produce the Turtle that trails rml run consumes.

Consequences¶

Positive

Structured data sources (CSV, JSON, XML) become first-class without writing Python extractors. A Turtle mapping file is all that's needed.
Mappings are versionable, diffable, auditable artifacts — better traceability than procedural code for data-pipeline compliance.
PROV-O provenance covers the full chain: which mapping, which source, when, how many triples. trails trace works unchanged.
The @rml_source registry decouples mapping logic from deployment paths — the same mapping runs in dev (local CSV) and prod (S3 export) by rebinding the source.
RML mappings stored in the KG are queryable: "show me all mappings that reference the sales source" is a SPARQL query.

Negative

New optional dependency (morph-kgc). Mitigation: lazy import, not in the default install. Users who never touch RML pay nothing.
RML has a learning curve for users unfamiliar with RDF mapping languages. Mitigation: the trails rml generate helper (Phase 2) bootstraps a mapping from a CSV/JSON schema; users edit rather than write from scratch.
Morph-KGC's output is an in-memory RDF graph (rdflib). Loading it into Oxigraph requires serialisation round-trip (N-Triples → parse). Mitigation: N-Triples is the fastest serialisation format for bulk load; the overhead is negligible for datasets under 1M triples and acceptable up to ~10M.

Neutral

Storage model unchanged. RML-generated triples live in the same Oxigraph store as everything else. No new named-graph convention beyond the existing provenance graph.
The code-extractor path is unaffected. ingest_file("paper.pdf", ctx) continues to work exactly as today.

Relationship to other ADRs¶

ADR	Impact
ADR-0009 (Provenance always on)	Extended: RML activities use the same PROV-O pattern as ingest activities. New `activityKind` value `trails.rml.run_mapping`.
ADR-0017 (ActiveGraph ORM)	Unchanged. RML-generated triples are raw triples, not ORM nodes. Users who want ORM access define `@node_type` classes whose IRIs match the mapping output — additive, per ADR-0021.
ADR-0019 (App surface)	Extended: RML is a new ingestion pathway alongside code extractors.
ADR-0021 (Progressive enhancement)	Aligned. RML is an additive feature. No mode switch.
ADR-0022 (Cedar unified matcher)	Unchanged. Cedar policies match on the triples' types regardless of how they were produced.

Alternatives considered¶

Custom trails.ingest extractors for CSV / JSON / XML. Rejected. Each new source schema needs a new function, new tests, manual triple construction. Scales linearly; RML scales with mapping files.
SPARQL CONSTRUCT from a staging graph. Load raw data as untyped triples, then CONSTRUCT the target shape. Rejected: requires two-phase ingest, staging graph lifecycle management, and SPARQL fluency for what should be a mapping concern.
Use YARRRML (YAML-based RML syntax) as the primary surface. Rejected for Phase 1. YARRRML is friendlier but adds a transpilation step and another spec to track. The Turtle-native surface is closer to the KG substrate. YARRRML support can be added as a convenience layer in Phase 2 (parse YAML → emit Turtle → hand to run_mapping).
Direct rdflib integration without Morph-KGC. Rejected. Building an RML processor from scratch contradicts the non-goal. Morph-KGC is W3C-test-suite-compliant and actively maintained.

Open questions¶

Batch size for large mappings. Morph-KGC materialises the full output graph in memory. For very large sources (>10M rows) this may require chunked execution (partition the source, run N sub-mappings, merge). Should the chunking be automatic or caller-controlled? Recommendation: caller-controlled in Phase 1 (pass --chunk-size to trails rml run); automatic partitioning as a Phase 2 optimisation.
Named-graph placement. Should RML output go into the default graph or a mapping-specific named graph? The provenance activity already records the mapping origin, so a separate named graph is redundant for lineage. But named graphs simplify "delete everything this mapping produced" for re-runs. Recommendation: default graph, with an opt-in --graph flag on trails rml run.
Mapping-file storage. Should mappings be stored in the KG (as prov:Entity nodes with the Turtle serialised in a literal) or on the filesystem? Recommendation: filesystem-first (mappings live in the project repo, versioned by git). KG storage as an opt-in for apps that need runtime-editable mappings.
trails rml generate scope. The auto-generation helper is listed under Phase 2 / progressive enhancement. Should it infer OWL classes from CSV headers, or stop at rdf:type + literal properties? Recommendation: start with rdf:type + literals; OWL inference belongs in the auto-ontology ADR.