RML Data Mapping¶

trails.rml turns structured data sources — CSV, JSON, XML — into typed RDF triples using declarative RML mapping files. Instead of writing a Python extractor for every source schema, you write a Turtle mapping that describes how rows and fields map to subjects, predicates, and objects. The full design lives in ADR-0024; the progressive- enhancement framing that keeps RML additive is ADR-0021.

When to use RML vs code extractors¶

Scenario	Recommended
Unstructured text (PDF, HTML, Markdown)	Code extractors via `trails.ingest`
Structured tabular data (CSV, TSV)	RML mapping
Semi-structured data (JSON, XML)	RML mapping
One-off import with custom logic	Code extractor
Many source schemas that change often	RML mappings (declarative, diffable)
RDBMS tables	RML mapping with `@rml_source(kind="rdbms")`

RML mappings are data, not code. They can be versioned, diffed, reviewed, and shared without touching Python. They are also RDF themselves, so they can live in the KG and be queried with SPARQL.

Installation¶

RML support requires Morph-KGC, an actively maintained W3C-test-suite-compliant RML processor:

pip install 'trails[rml]'

The import is lazy. If morph-kgc is not installed, trails.rml raises TrailsError with the exact install command — no silent fallback, no raw ImportError.

Quickstart¶

One CSV file, one mapping, loaded into the KG with full provenance:

from trails.testing import fresh_context
from trails.rml import run_mapping

ctx = fresh_context()
result = run_mapping(ctx, "mappings/sales.ttl", sales="/data/sales.csv")
print(f"Loaded {result.triples_added} triples in {result.duration_ms:.0f}ms")
print(f"Activity: {result.activity_iri}")

Writing your first RML mapping¶

Suppose you have a CSV file employees.csv:

id,name,department,salary
1,Alice,Engineering,95000
2,Bob,Marketing,82000
3,Carol,Engineering,98000

Write a Turtle mapping file mappings/employees.ttl:

@prefix rr:  <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql:  <http://semweb.mmlab.be/ns/ql#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex:  <https://myapp.example/> .

<#EmployeeMapping> a rr:TriplesMap ;
    rml:logicalSource [
        rml:source "employees.csv" ;
        rml:referenceFormulation ql:CSV
    ] ;
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}" ;
        rr:class ex:Employee
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:name ;
        rr:objectMap [ rml:reference "name" ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:department ;
        rr:objectMap [ rml:reference "department" ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:salary ;
        rr:objectMap [
            rml:reference "salary" ;
            rr:datatype xsd:integer
        ]
    ] .

Execute it:

from trails.rml import run_mapping

result = run_mapping(ctx, "mappings/employees.ttl")
# Loads 3 employees x 4 triples each (type + 3 properties) = 12 triples

The generated triples look like:

<https://myapp.example/Employee/1> a <https://myapp.example/Employee> ;
    <https://myapp.example/name> "Alice" ;
    <https://myapp.example/department> "Engineering" ;
    <https://myapp.example/salary> "95000"^^xsd:integer .

CLI commands¶

`trails rml run`¶

Execute an RML mapping and load results into the local KG:

# Basic execution
trails rml run mappings/employees.ttl

# Override source paths at invocation time
trails rml run mappings/sales.ttl --source-override sales=/tmp/staged/sales.csv

# Write to a specific named graph
trails rml run mappings/employees.ttl --graph https://myapp.example/imports/2026-04

--source-override rebinds logical source paths without editing the mapping file — useful for CI pipelines that stage data in temp directories.

`trails rml validate`¶

Check a mapping for structural errors without executing it:

trails rml validate mappings/employees.ttl

The validator checks:

Check	Severity	Description
Turtle syntax	Error	The file must parse as valid Turtle
Subject maps	Error	Every `rr:TriplesMap` must have `rr:subjectMap` or `rr:subject`
Predicate-object maps	Error	Every `rr:TriplesMap` must have at least one `rr:predicateObjectMap`
Logical source reachability	Warning	Local file paths referenced by `rml:source` are checked for existence

An empty issue list means the mapping passed all checks:

from trails.rml import validate_mapping

issues = validate_mapping("mappings/employees.ttl")
if issues:
    for issue in issues:
        print(f"[{issue.severity.upper()}] {issue.message}")
else:
    print("Mapping is valid")

`trails rml sources`¶

List all registered RML sources (see @rml_source below):

trails rml sources

Source registry with `@rml_source`¶

The @rml_source decorator registers a reusable data source that RML mappings can reference by logical name. This decouples mapping logic from deployment paths — the same mapping runs in dev (local CSV) and prod (S3 export or API endpoint) by rebinding the source.

from trails.rml import rml_source

@rml_source("sales_csv", kind="csv")
def sales_data():
    return "/data/exports/sales_2026.csv"

@rml_source("product_feed", kind="json")
def product_api():
    return "https://api.example.com/products.json"

Supported kinds: "csv", "json", "xml" (Phase 1), "rdbms", "http" (Phase 2).

Inspect the registry programmatically:

from trails.rml import get_source, list_sources, clear_sources

# Look up a single source
entry = get_source("sales_csv")
print(entry.resolve())  # "/data/exports/sales_2026.csv"

# List all registered sources
for src in list_sources():
    print(f"{src.name} ({src.kind}): {src.resolve()}")

# Clear the registry (test teardown)
clear_sources()

The runner resolves rml:logicalSource references against this registry so mappings can use stable names. Sources registered via @rml_source also appear in trails rml validate output (reachability check).

Integration with IngestPipeline¶

See the IngestPipeline integration (Phase 2) section above for how RML plugs into ingest_file() via the rml_mapping= parameter.

PROV-O provenance¶

Every run_mapping execution emits PROV-O triples in the framework provenance graph (https://trails.dev/ns/prov/), consistent with ADR-0009:

<https://trails.dev/activity/rml/019…> a prov:Activity ;
    prov:startedAtTime "2026-04-17T14:30:00Z"^^xsd:dateTime ;
    prov:endedAtTime   "2026-04-17T14:30:02Z"^^xsd:dateTime ;
    prov:used          <file:///project/mappings/sales.ttl> ;
    prov:used          <file:///data/sales.csv> ;
    trails:activityKind "trails.rml.run_mapping" ;
    trails:tripleCount  4287 .

The mapping file and each source file are asserted as prov:Entity nodes so downstream trace tooling (trails trace) can answer "which mapping produced this triple?" without scraping activity IRIs.

The MappingResult object returned by run_mapping carries the activity_iri for programmatic access:

result = run_mapping(ctx, "mappings/sales.ttl")
print(result.activity_iri)
# "https://trails.dev/activity/rml/019…"

RDBMS sources (Phase 2)¶

Register a database as an RML source with kind="rdbms". The resolver returns a SQLAlchemy connection string:

from trails.rml import rml_source

@rml_source("warehouse", kind="rdbms")
def warehouse_db():
    return "postgresql://user:pass@host/db"

Write a mapping that references the database tables. The RML processor issues SQL queries to extract rows and maps them to triples using the same rr:subjectMap / rr:predicateObjectMap syntax as file-based mappings:

@prefix rr:  <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql:  <http://semweb.mmlab.be/ns/ql#> .
@prefix ex:  <https://myapp.example/> .

<#OrderMapping> a rr:TriplesMap ;
    rml:logicalSource [
        rml:source "warehouse" ;
        rr:sqlVersion rr:SQL2008 ;
        rr:tableName "orders"
    ] ;
    rr:subjectMap [
        rr:template "https://myapp.example/Order/{order_id}" ;
        rr:class ex:Order
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:customer ;
        rr:objectMap [ rml:reference "customer_name" ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate ex:total ;
        rr:objectMap [
            rml:reference "total_amount" ;
            rr:datatype <http://www.w3.org/2001/XMLSchema#decimal>
        ]
    ] .

HTTP sources (Phase 2)¶

Register a URL as an RML source with kind="http". The runner fetches the content before executing the mapping:

@rml_source("product_api", kind="http")
def product_feed():
    return "https://api.example.com/products.json"

The source URL is fetched to a temporary file, then the mapping runs against it. This works with JSON, XML, or CSV responses. Use --source-override in CI to point at a cached local copy instead of hitting the API every run:

trails rml run mappings/products.ttl \
    --source-override product_api=/tmp/cached/products.json

IngestPipeline integration (Phase 2)¶

RML integrates with the existing trails.ingest pipeline. When rml_mapping= is set, the pipeline delegates to trails.rml instead of code extractors:

from trails.ingest import ingest_file

# Code extractor (unchanged — progressive enhancement)
doc_iri = ingest_file("paper.pdf", ctx)

# RML-driven ingestion for structured data
doc_iri = ingest_file(
    "transactions.csv",
    ctx,
    rml_mapping="mappings/transactions.ttl",
)

When using RML through the ingest pipeline:

The chunker step is skipped — RML produces typed triples directly, not flat text that needs splitting.
PROV-O provenance is emitted by the RML runner, not the ingest pipeline, to avoid double-counting. The ingest prov:Activity links to the RML activity via prov:wasInformedBy.
Content-hash dedup still applies at the file level.

Auto-generating mappings with `trails rml generate` (Phase 3)¶

Instead of hand-writing RML Turtle for every source, use generate_mapping to auto-generate a mapping from a source file's schema. The generator reads headers/keys/elements from CSV, JSON, or XML files, infers XSD types from sampled values, and produces a valid RML mapping ready for trails rml run.

CLI usage¶

# Generate from a CSV file (prints Turtle to stdout)
trails rml generate data/employees.csv

# Specify source type explicitly (auto-detected from extension by default)
trails rml generate data/products.json --source-type json

# Custom base IRI and node type name
trails rml generate data/orders.csv \
    --base-iri https://myapp.example/ \
    --node-type-name Order

# Write output to a file
trails rml generate data/employees.csv -o mappings/employees.ttl

Python API¶

from trails.rml.generate import generate_mapping

result = generate_mapping(
    "data/employees.csv",
    source_type="csv",
    base_iri="https://myapp.example/",
    node_type_name="employee",
)

print(f"Detected columns: {result.columns}")
print(f"Inferred types: {result.inferred_types}")
print(f"Node type: {result.node_type_name}")
print()
print(result.turtle)

Type inference¶

The generator samples the first 100 rows and infers XSD types using regex heuristics:

Pattern	Inferred type
All integers (`42`, `-7`)	`xsd:integer`
All floats (`3.14`, `1.0e5`)	`xsd:double`
All dates (`2026-04-17`)	`xsd:date`
All datetimes (`2026-04-17T14:30:00Z`)	`xsd:dateTime`
All booleans (`true`, `false`, `yes`, `no`)	`xsd:boolean`
All HTTP(S) URLs	`xsd:anyURI` (rendered as `rr:IRI` term type)
Everything else	`xsd:string`

Example output¶

Given employees.csv with columns id, name, department, salary, the generated mapping looks like:

@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix rr:  <http://www.w3.org/ns/r2rml#> .
@prefix ql:  <http://semweb.mmlab.be/ns/ql#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<#TriplesMap_Employee> a rr:TriplesMap ;
    rml:logicalSource [
        rml:source "employees.csv" ;
        rml:referenceFormulation ql:CSV
    ] ;
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}" ;
        rr:class <https://myapp.example/Employee>
    ] ;
    rr:predicateObjectMap [
        rr:predicate <https://myapp.example/id> ;
        rr:objectMap [
            rml:reference "id" ;
            rr:datatype xsd:integer
        ]
    ] ;
    rr:predicateObjectMap [
        rr:predicate <https://myapp.example/name> ;
        rr:objectMap [ rml:reference "name" ]
    ] ;
    ...

The output is a starting point. Review the generated mapping, adjust types and predicates, then run it with trails rml run.

Anti-patterns¶

Hardcoding paths in mapping files. Use @rml_source or --source-override to decouple paths from mappings. Hardcoded paths break when moving between dev, CI, and production.

Skipping validation before execution. Run trails rml validate in CI to catch syntax errors and missing sources before they fail at runtime. Validation is fast and free.

Using RML for unstructured text. RML is designed for structured and semi-structured data with predictable schemas. For PDFs, HTML, and plain text, use the code extractors in trails.ingest.

Reference¶

Symbol	Description
`run_mapping(ctx, mapping_path, **source_overrides)`	Execute an RML mapping; returns `MappingResult`
`MappingResult`	`.triples_added`, `.sources`, `.duration_ms`, `.activity_iri`
`validate_mapping(mapping_path)`	Validate without executing; returns `list[ValidationIssue]`
`ValidationIssue`	`.severity` (`"error"` / `"warning"`), `.message`
`rml_source(name, *, kind)`	Decorator: register a function as an RML logical source. Kinds: `"csv"`, `"json"`, `"xml"`, `"rdbms"`, `"http"`
`SourceEntry`	`.name`, `.kind`, `.resolver`, `.resolve()`
`get_source(name)`	Look up a registered source by name
`list_sources()`	Return all registered sources, sorted by name
`clear_sources()`	Remove all registered sources (test teardown)
`generate_mapping(source_path, *, source_type, base_iri, node_type_name)`	Phase 3: auto-generate an RML mapping from a source file's schema
`GeneratedMapping`	`.turtle`, `.source_type`, `.columns`, `.inferred_types`, `.node_type_name`