RML Data Mapping¶
trails.rml turns structured data sources — CSV, JSON, XML — into typed
RDF triples using declarative RML mapping
files. Instead of writing a Python extractor for every source schema, you
write a Turtle mapping that describes how rows and fields map to subjects,
predicates, and objects. The full design lives in
ADR-0024; the progressive-
enhancement framing that keeps RML additive is
ADR-0021.
When to use RML vs code extractors¶
| Scenario | Recommended |
|---|---|
| Unstructured text (PDF, HTML, Markdown) | Code extractors via trails.ingest |
| Structured tabular data (CSV, TSV) | RML mapping |
| Semi-structured data (JSON, XML) | RML mapping |
| One-off import with custom logic | Code extractor |
| Many source schemas that change often | RML mappings (declarative, diffable) |
| RDBMS tables | RML mapping with @rml_source(kind="rdbms") |
RML mappings are data, not code. They can be versioned, diffed, reviewed, and shared without touching Python. They are also RDF themselves, so they can live in the KG and be queried with SPARQL.
Installation¶
RML support requires Morph-KGC, an actively maintained W3C-test-suite-compliant RML processor:
The import is lazy. If morph-kgc is not installed, trails.rml raises
TrailsError with the exact install command — no silent fallback, no raw
ImportError.
Quickstart¶
One CSV file, one mapping, loaded into the KG with full provenance:
from trails.testing import fresh_context
from trails.rml import run_mapping
ctx = fresh_context()
result = run_mapping(ctx, "mappings/sales.ttl", sales="/data/sales.csv")
print(f"Loaded {result.triples_added} triples in {result.duration_ms:.0f}ms")
print(f"Activity: {result.activity_iri}")
Writing your first RML mapping¶
Suppose you have a CSV file employees.csv:
Write a Turtle mapping file mappings/employees.ttl:
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <https://myapp.example/> .
<#EmployeeMapping> a rr:TriplesMap ;
rml:logicalSource [
rml:source "employees.csv" ;
rml:referenceFormulation ql:CSV
] ;
rr:subjectMap [
rr:template "https://myapp.example/Employee/{id}" ;
rr:class ex:Employee
] ;
rr:predicateObjectMap [
rr:predicate ex:name ;
rr:objectMap [ rml:reference "name" ]
] ;
rr:predicateObjectMap [
rr:predicate ex:department ;
rr:objectMap [ rml:reference "department" ]
] ;
rr:predicateObjectMap [
rr:predicate ex:salary ;
rr:objectMap [
rml:reference "salary" ;
rr:datatype xsd:integer
]
] .
Execute it:
from trails.rml import run_mapping
result = run_mapping(ctx, "mappings/employees.ttl")
# Loads 3 employees x 4 triples each (type + 3 properties) = 12 triples
The generated triples look like:
<https://myapp.example/Employee/1> a <https://myapp.example/Employee> ;
<https://myapp.example/name> "Alice" ;
<https://myapp.example/department> "Engineering" ;
<https://myapp.example/salary> "95000"^^xsd:integer .
CLI commands¶
trails rml run¶
Execute an RML mapping and load results into the local KG:
# Basic execution
trails rml run mappings/employees.ttl
# Override source paths at invocation time
trails rml run mappings/sales.ttl --source-override sales=/tmp/staged/sales.csv
# Write to a specific named graph
trails rml run mappings/employees.ttl --graph https://myapp.example/imports/2026-04
--source-override rebinds logical source paths without editing the
mapping file — useful for CI pipelines that stage data in temp directories.
trails rml validate¶
Check a mapping for structural errors without executing it:
The validator checks:
| Check | Severity | Description |
|---|---|---|
| Turtle syntax | Error | The file must parse as valid Turtle |
| Subject maps | Error | Every rr:TriplesMap must have rr:subjectMap or rr:subject |
| Predicate-object maps | Error | Every rr:TriplesMap must have at least one rr:predicateObjectMap |
| Logical source reachability | Warning | Local file paths referenced by rml:source are checked for existence |
An empty issue list means the mapping passed all checks:
from trails.rml import validate_mapping
issues = validate_mapping("mappings/employees.ttl")
if issues:
for issue in issues:
print(f"[{issue.severity.upper()}] {issue.message}")
else:
print("Mapping is valid")
trails rml sources¶
List all registered RML sources (see @rml_source below):
Source registry with @rml_source¶
The @rml_source decorator registers a reusable data source that RML
mappings can reference by logical name. This decouples mapping logic from
deployment paths — the same mapping runs in dev (local CSV) and prod
(S3 export or API endpoint) by rebinding the source.
from trails.rml import rml_source
@rml_source("sales_csv", kind="csv")
def sales_data():
return "/data/exports/sales_2026.csv"
@rml_source("product_feed", kind="json")
def product_api():
return "https://api.example.com/products.json"
Supported kinds: "csv", "json", "xml" (Phase 1), "rdbms",
"http" (Phase 2).
Inspect the registry programmatically:
from trails.rml import get_source, list_sources, clear_sources
# Look up a single source
entry = get_source("sales_csv")
print(entry.resolve()) # "/data/exports/sales_2026.csv"
# List all registered sources
for src in list_sources():
print(f"{src.name} ({src.kind}): {src.resolve()}")
# Clear the registry (test teardown)
clear_sources()
The runner resolves rml:logicalSource references against this registry
so mappings can use stable names. Sources registered via @rml_source
also appear in trails rml validate output (reachability check).
Integration with IngestPipeline¶
See the IngestPipeline integration (Phase 2)
section above for how RML plugs into ingest_file() via the
rml_mapping= parameter.
PROV-O provenance¶
Every run_mapping execution emits PROV-O triples in the framework
provenance graph (https://trails.dev/ns/prov/), consistent with
ADR-0009:
<https://trails.dev/activity/rml/019…> a prov:Activity ;
prov:startedAtTime "2026-04-17T14:30:00Z"^^xsd:dateTime ;
prov:endedAtTime "2026-04-17T14:30:02Z"^^xsd:dateTime ;
prov:used <file:///project/mappings/sales.ttl> ;
prov:used <file:///data/sales.csv> ;
trails:activityKind "trails.rml.run_mapping" ;
trails:tripleCount 4287 .
The mapping file and each source file are asserted as prov:Entity
nodes so downstream trace tooling (trails trace) can answer "which
mapping produced this triple?" without scraping activity IRIs.
The MappingResult object returned by run_mapping carries the
activity_iri for programmatic access:
result = run_mapping(ctx, "mappings/sales.ttl")
print(result.activity_iri)
# "https://trails.dev/activity/rml/019…"
RDBMS sources (Phase 2)¶
Register a database as an RML source with kind="rdbms". The resolver
returns a SQLAlchemy connection string:
from trails.rml import rml_source
@rml_source("warehouse", kind="rdbms")
def warehouse_db():
return "postgresql://user:pass@host/db"
Write a mapping that references the database tables. The RML processor
issues SQL queries to extract rows and maps them to triples using the
same rr:subjectMap / rr:predicateObjectMap syntax as file-based
mappings:
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix ex: <https://myapp.example/> .
<#OrderMapping> a rr:TriplesMap ;
rml:logicalSource [
rml:source "warehouse" ;
rr:sqlVersion rr:SQL2008 ;
rr:tableName "orders"
] ;
rr:subjectMap [
rr:template "https://myapp.example/Order/{order_id}" ;
rr:class ex:Order
] ;
rr:predicateObjectMap [
rr:predicate ex:customer ;
rr:objectMap [ rml:reference "customer_name" ]
] ;
rr:predicateObjectMap [
rr:predicate ex:total ;
rr:objectMap [
rml:reference "total_amount" ;
rr:datatype <http://www.w3.org/2001/XMLSchema#decimal>
]
] .
HTTP sources (Phase 2)¶
Register a URL as an RML source with kind="http". The runner fetches
the content before executing the mapping:
@rml_source("product_api", kind="http")
def product_feed():
return "https://api.example.com/products.json"
The source URL is fetched to a temporary file, then the mapping runs
against it. This works with JSON, XML, or CSV responses. Use
--source-override in CI to point at a cached local copy instead of
hitting the API every run:
IngestPipeline integration (Phase 2)¶
RML integrates with the existing trails.ingest pipeline. When
rml_mapping= is set, the pipeline delegates to trails.rml instead of
code extractors:
from trails.ingest import ingest_file
# Code extractor (unchanged — progressive enhancement)
doc_iri = ingest_file("paper.pdf", ctx)
# RML-driven ingestion for structured data
doc_iri = ingest_file(
"transactions.csv",
ctx,
rml_mapping="mappings/transactions.ttl",
)
When using RML through the ingest pipeline:
- The chunker step is skipped — RML produces typed triples directly, not flat text that needs splitting.
- PROV-O provenance is emitted by the RML runner, not the ingest
pipeline, to avoid double-counting. The ingest
prov:Activitylinks to the RML activity viaprov:wasInformedBy. - Content-hash dedup still applies at the file level.
Auto-generating mappings with trails rml generate (Phase 3)¶
Instead of hand-writing RML Turtle for every source, use
generate_mapping to auto-generate a mapping from a source file's
schema. The generator reads headers/keys/elements from CSV, JSON, or XML
files, infers XSD types from sampled values, and produces a valid RML
mapping ready for trails rml run.
CLI usage¶
# Generate from a CSV file (prints Turtle to stdout)
trails rml generate data/employees.csv
# Specify source type explicitly (auto-detected from extension by default)
trails rml generate data/products.json --source-type json
# Custom base IRI and node type name
trails rml generate data/orders.csv \
--base-iri https://myapp.example/ \
--node-type-name Order
# Write output to a file
trails rml generate data/employees.csv -o mappings/employees.ttl
Python API¶
from trails.rml.generate import generate_mapping
result = generate_mapping(
"data/employees.csv",
source_type="csv",
base_iri="https://myapp.example/",
node_type_name="employee",
)
print(f"Detected columns: {result.columns}")
print(f"Inferred types: {result.inferred_types}")
print(f"Node type: {result.node_type_name}")
print()
print(result.turtle)
Type inference¶
The generator samples the first 100 rows and infers XSD types using regex heuristics:
| Pattern | Inferred type |
|---|---|
All integers (42, -7) |
xsd:integer |
All floats (3.14, 1.0e5) |
xsd:double |
All dates (2026-04-17) |
xsd:date |
All datetimes (2026-04-17T14:30:00Z) |
xsd:dateTime |
All booleans (true, false, yes, no) |
xsd:boolean |
| All HTTP(S) URLs | xsd:anyURI (rendered as rr:IRI term type) |
| Everything else | xsd:string |
Example output¶
Given employees.csv with columns id, name, department, salary, the
generated mapping looks like:
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<#TriplesMap_Employee> a rr:TriplesMap ;
rml:logicalSource [
rml:source "employees.csv" ;
rml:referenceFormulation ql:CSV
] ;
rr:subjectMap [
rr:template "https://myapp.example/Employee/{id}" ;
rr:class <https://myapp.example/Employee>
] ;
rr:predicateObjectMap [
rr:predicate <https://myapp.example/id> ;
rr:objectMap [
rml:reference "id" ;
rr:datatype xsd:integer
]
] ;
rr:predicateObjectMap [
rr:predicate <https://myapp.example/name> ;
rr:objectMap [ rml:reference "name" ]
] ;
...
The output is a starting point. Review the generated mapping, adjust
types and predicates, then run it with trails rml run.
Anti-patterns¶
Hardcoding paths in mapping files. Use @rml_source or
--source-override to decouple paths from mappings. Hardcoded paths
break when moving between dev, CI, and production.
Skipping validation before execution. Run trails rml validate
in CI to catch syntax errors and missing sources before they fail at
runtime. Validation is fast and free.
Using RML for unstructured text. RML is designed for structured and
semi-structured data with predictable schemas. For PDFs, HTML, and
plain text, use the code extractors in trails.ingest.
Reference¶
| Symbol | Description |
|---|---|
run_mapping(ctx, mapping_path, **source_overrides) |
Execute an RML mapping; returns MappingResult |
MappingResult |
.triples_added, .sources, .duration_ms, .activity_iri |
validate_mapping(mapping_path) |
Validate without executing; returns list[ValidationIssue] |
ValidationIssue |
.severity ("error" / "warning"), .message |
rml_source(name, *, kind) |
Decorator: register a function as an RML logical source. Kinds: "csv", "json", "xml", "rdbms", "http" |
SourceEntry |
.name, .kind, .resolver, .resolve() |
get_source(name) |
Look up a registered source by name |
list_sources() |
Return all registered sources, sorted by name |
clear_sources() |
Remove all registered sources (test teardown) |
generate_mapping(source_path, *, source_type, base_iri, node_type_name) |
Phase 3: auto-generate an RML mapping from a source file's schema |
GeneratedMapping |
.turtle, .source_type, .columns, .inferred_types, .node_type_name |
See also¶
- ADR-0024 — full RML integration design
- Document Ingestion — code extractors for unstructured text
- ActiveGraph ORM —
@node_typefor the triples RML produces - Observability — cost tracking and PROV-O events
- RML specification — the W3C Community Group spec
- Morph-KGC — the RML processor Trails wraps