Skip to content

RML Data Mapping

trails.rml turns structured data sources — CSV, JSON, XML — into typed RDF triples using declarative RML mapping files. Instead of writing a Python extractor for every source schema, you write a Turtle mapping that describes how rows and fields map to subjects, predicates, and objects. The full design lives in ADR-0024; the progressive- enhancement framing that keeps RML additive is ADR-0021.

When to use RML vs code extractors

Scenario Recommended
Unstructured text (PDF, HTML, Markdown) Code extractors via trails.ingest
Structured tabular data (CSV, TSV) RML mapping
Semi-structured data (JSON, XML) RML mapping
One-off import with custom logic Code extractor
Many source schemas that change often RML mappings (declarative, diffable)
RDBMS tables RML mapping with @rml_source(kind="rdbms")

RML mappings are data, not code. They can be versioned, diffed, reviewed, and shared without touching Python. They are also RDF themselves, so they can live in the KG and be queried with SPARQL.

Installation

RML support requires Morph-KGC, an actively maintained W3C-test-suite-compliant RML processor:

pip install 'trails[rml]'

The import is lazy. If morph-kgc is not installed, trails.rml raises TrailsError with the exact install command — no silent fallback, no raw ImportError.

Quickstart

One CSV file, one mapping, loaded into the KG with full provenance:

from trails.testing import fresh_context
from trails.rml import run_mapping

ctx = fresh_context()
result = run_mapping(ctx, "mappings/sales.ttl", sales="/data/sales.csv")
print(f"Loaded {result.triples_added} triples in {result.duration_ms:.0f}ms")
print(f"Activity: {result.activity_iri}")

Writing your first RML mapping

Suppose you have a CSV file employees.csv:

id,name,department,salary
1,Alice,Engineering,95000
2,Bob,Marketing,82000
3,Carol,Engineering,98000

Write a Turtle mapping file mappings/employees.ttl:

@prefix rr:  <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql:  <http://semweb.mmlab.be/ns/ql#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex:  <https://myapp.example/> .

<#EmployeeMapping> a rr:TriplesMap ;
    rml:logicalSource [
        rml:source "employees.csv" ;
        rml:referenceFormulation ql:CSV
    ] ;
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}" ;
        rr:class ex:Employee
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:name ;
        rr:objectMap [ rml:reference "name" ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:department ;
        rr:objectMap [ rml:reference "department" ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:salary ;
        rr:objectMap [
            rml:reference "salary" ;
            rr:datatype xsd:integer
        ]
    ] .

Execute it:

from trails.rml import run_mapping

result = run_mapping(ctx, "mappings/employees.ttl")
# Loads 3 employees x 4 triples each (type + 3 properties) = 12 triples

The generated triples look like:

<https://myapp.example/Employee/1> a <https://myapp.example/Employee> ;
    <https://myapp.example/name> "Alice" ;
    <https://myapp.example/department> "Engineering" ;
    <https://myapp.example/salary> "95000"^^xsd:integer .

CLI commands

trails rml run

Execute an RML mapping and load results into the local KG:

# Basic execution
trails rml run mappings/employees.ttl

# Override source paths at invocation time
trails rml run mappings/sales.ttl --source-override sales=/tmp/staged/sales.csv

# Write to a specific named graph
trails rml run mappings/employees.ttl --graph https://myapp.example/imports/2026-04

--source-override rebinds logical source paths without editing the mapping file — useful for CI pipelines that stage data in temp directories.

trails rml validate

Check a mapping for structural errors without executing it:

trails rml validate mappings/employees.ttl

The validator checks:

Check Severity Description
Turtle syntax Error The file must parse as valid Turtle
Subject maps Error Every rr:TriplesMap must have rr:subjectMap or rr:subject
Predicate-object maps Error Every rr:TriplesMap must have at least one rr:predicateObjectMap
Logical source reachability Warning Local file paths referenced by rml:source are checked for existence

An empty issue list means the mapping passed all checks:

from trails.rml import validate_mapping

issues = validate_mapping("mappings/employees.ttl")
if issues:
    for issue in issues:
        print(f"[{issue.severity.upper()}] {issue.message}")
else:
    print("Mapping is valid")

trails rml sources

List all registered RML sources (see @rml_source below):

trails rml sources

Source registry with @rml_source

The @rml_source decorator registers a reusable data source that RML mappings can reference by logical name. This decouples mapping logic from deployment paths — the same mapping runs in dev (local CSV) and prod (S3 export or API endpoint) by rebinding the source.

from trails.rml import rml_source

@rml_source("sales_csv", kind="csv")
def sales_data():
    return "/data/exports/sales_2026.csv"

@rml_source("product_feed", kind="json")
def product_api():
    return "https://api.example.com/products.json"

Supported kinds: "csv", "json", "xml" (Phase 1), "rdbms", "http" (Phase 2).

Inspect the registry programmatically:

from trails.rml import get_source, list_sources, clear_sources

# Look up a single source
entry = get_source("sales_csv")
print(entry.resolve())  # "/data/exports/sales_2026.csv"

# List all registered sources
for src in list_sources():
    print(f"{src.name} ({src.kind}): {src.resolve()}")

# Clear the registry (test teardown)
clear_sources()

The runner resolves rml:logicalSource references against this registry so mappings can use stable names. Sources registered via @rml_source also appear in trails rml validate output (reachability check).

Integration with IngestPipeline

See the IngestPipeline integration (Phase 2) section above for how RML plugs into ingest_file() via the rml_mapping= parameter.

PROV-O provenance

Every run_mapping execution emits PROV-O triples in the framework provenance graph (https://trails.dev/ns/prov/), consistent with ADR-0009:

<https://trails.dev/activity/rml/019…> a prov:Activity ;
    prov:startedAtTime "2026-04-17T14:30:00Z"^^xsd:dateTime ;
    prov:endedAtTime   "2026-04-17T14:30:02Z"^^xsd:dateTime ;
    prov:used          <file:///project/mappings/sales.ttl> ;
    prov:used          <file:///data/sales.csv> ;
    trails:activityKind "trails.rml.run_mapping" ;
    trails:tripleCount  4287 .

The mapping file and each source file are asserted as prov:Entity nodes so downstream trace tooling (trails trace) can answer "which mapping produced this triple?" without scraping activity IRIs.

The MappingResult object returned by run_mapping carries the activity_iri for programmatic access:

result = run_mapping(ctx, "mappings/sales.ttl")
print(result.activity_iri)
# "https://trails.dev/activity/rml/019…"

RDBMS sources (Phase 2)

Register a database as an RML source with kind="rdbms". The resolver returns a SQLAlchemy connection string:

from trails.rml import rml_source

@rml_source("warehouse", kind="rdbms")
def warehouse_db():
    return "postgresql://user:pass@host/db"

Write a mapping that references the database tables. The RML processor issues SQL queries to extract rows and maps them to triples using the same rr:subjectMap / rr:predicateObjectMap syntax as file-based mappings:

@prefix rr:  <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql:  <http://semweb.mmlab.be/ns/ql#> .
@prefix ex:  <https://myapp.example/> .

<#OrderMapping> a rr:TriplesMap ;
    rml:logicalSource [
        rml:source "warehouse" ;
        rr:sqlVersion rr:SQL2008 ;
        rr:tableName "orders"
    ] ;
    rr:subjectMap [
        rr:template "https://myapp.example/Order/{order_id}" ;
        rr:class ex:Order
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:customer ;
        rr:objectMap [ rml:reference "customer_name" ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:total ;
        rr:objectMap [
            rml:reference "total_amount" ;
            rr:datatype <http://www.w3.org/2001/XMLSchema#decimal>
        ]
    ] .

HTTP sources (Phase 2)

Register a URL as an RML source with kind="http". The runner fetches the content before executing the mapping:

@rml_source("product_api", kind="http")
def product_feed():
    return "https://api.example.com/products.json"

The source URL is fetched to a temporary file, then the mapping runs against it. This works with JSON, XML, or CSV responses. Use --source-override in CI to point at a cached local copy instead of hitting the API every run:

trails rml run mappings/products.ttl \
    --source-override product_api=/tmp/cached/products.json

IngestPipeline integration (Phase 2)

RML integrates with the existing trails.ingest pipeline. When rml_mapping= is set, the pipeline delegates to trails.rml instead of code extractors:

from trails.ingest import ingest_file

# Code extractor (unchanged — progressive enhancement)
doc_iri = ingest_file("paper.pdf", ctx)

# RML-driven ingestion for structured data
doc_iri = ingest_file(
    "transactions.csv",
    ctx,
    rml_mapping="mappings/transactions.ttl",
)

When using RML through the ingest pipeline:

  • The chunker step is skipped — RML produces typed triples directly, not flat text that needs splitting.
  • PROV-O provenance is emitted by the RML runner, not the ingest pipeline, to avoid double-counting. The ingest prov:Activity links to the RML activity via prov:wasInformedBy.
  • Content-hash dedup still applies at the file level.

Auto-generating mappings with trails rml generate (Phase 3)

Instead of hand-writing RML Turtle for every source, use generate_mapping to auto-generate a mapping from a source file's schema. The generator reads headers/keys/elements from CSV, JSON, or XML files, infers XSD types from sampled values, and produces a valid RML mapping ready for trails rml run.

CLI usage

# Generate from a CSV file (prints Turtle to stdout)
trails rml generate data/employees.csv

# Specify source type explicitly (auto-detected from extension by default)
trails rml generate data/products.json --source-type json

# Custom base IRI and node type name
trails rml generate data/orders.csv \
    --base-iri https://myapp.example/ \
    --node-type-name Order

# Write output to a file
trails rml generate data/employees.csv -o mappings/employees.ttl

Python API

from trails.rml.generate import generate_mapping

result = generate_mapping(
    "data/employees.csv",
    source_type="csv",
    base_iri="https://myapp.example/",
    node_type_name="employee",
)

print(f"Detected columns: {result.columns}")
print(f"Inferred types: {result.inferred_types}")
print(f"Node type: {result.node_type_name}")
print()
print(result.turtle)

Type inference

The generator samples the first 100 rows and infers XSD types using regex heuristics:

Pattern Inferred type
All integers (42, -7) xsd:integer
All floats (3.14, 1.0e5) xsd:double
All dates (2026-04-17) xsd:date
All datetimes (2026-04-17T14:30:00Z) xsd:dateTime
All booleans (true, false, yes, no) xsd:boolean
All HTTP(S) URLs xsd:anyURI (rendered as rr:IRI term type)
Everything else xsd:string

Example output

Given employees.csv with columns id, name, department, salary, the generated mapping looks like:

@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix rr:  <http://www.w3.org/ns/r2rml#> .
@prefix ql:  <http://semweb.mmlab.be/ns/ql#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<#TriplesMap_Employee> a rr:TriplesMap ;
    rml:logicalSource [
        rml:source "employees.csv" ;
        rml:referenceFormulation ql:CSV
    ] ;
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}" ;
        rr:class <https://myapp.example/Employee>
    ] ;
    rr:predicateObjectMap [
        rr:predicate <https://myapp.example/id> ;
        rr:objectMap [
            rml:reference "id" ;
            rr:datatype xsd:integer
        ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate <https://myapp.example/name> ;
        rr:objectMap [ rml:reference "name" ]
    ] ;
    ...

The output is a starting point. Review the generated mapping, adjust types and predicates, then run it with trails rml run.

Anti-patterns

Hardcoding paths in mapping files. Use @rml_source or --source-override to decouple paths from mappings. Hardcoded paths break when moving between dev, CI, and production.

Skipping validation before execution. Run trails rml validate in CI to catch syntax errors and missing sources before they fail at runtime. Validation is fast and free.

Using RML for unstructured text. RML is designed for structured and semi-structured data with predictable schemas. For PDFs, HTML, and plain text, use the code extractors in trails.ingest.

Reference

Symbol Description
run_mapping(ctx, mapping_path, **source_overrides) Execute an RML mapping; returns MappingResult
MappingResult .triples_added, .sources, .duration_ms, .activity_iri
validate_mapping(mapping_path) Validate without executing; returns list[ValidationIssue]
ValidationIssue .severity ("error" / "warning"), .message
rml_source(name, *, kind) Decorator: register a function as an RML logical source. Kinds: "csv", "json", "xml", "rdbms", "http"
SourceEntry .name, .kind, .resolver, .resolve()
get_source(name) Look up a registered source by name
list_sources() Return all registered sources, sorted by name
clear_sources() Remove all registered sources (test teardown)
generate_mapping(source_path, *, source_type, base_iri, node_type_name) Phase 3: auto-generate an RML mapping from a source file's schema
GeneratedMapping .turtle, .source_type, .columns, .inferred_types, .node_type_name

See also