Chapter 8 — The Auto-Ontology Paradigm¶

Traditional knowledge-graph engineering starts with an ontology workshop: domain experts, whiteboard sessions, weeks of modelling before a single triple reaches the store. Trails does the opposite. You load data first, and the framework discovers your schema.

This chapter walks through the full auto-ontology workflow -- from a raw CSV to a production-grade ontology -- and explains why this is a genuine paradigm shift.

Why ontology-first fails¶

The classic KG workflow is:

design ontology → build store → validate data → ship

This creates a chicken-and-egg problem:

You cannot design a good ontology without data. Domain experts know their field but not how their data actually looks in practice. Fields that seem mandatory turn out optional in 40% of records. Relationships that seem obvious don't exist in the data.
The ontology becomes a bottleneck. Every new data source requires an ontology committee meeting. A field that doesn't fit the schema gets dropped or force-fit. The ontology calcifies.
The investment is front-loaded. You spend weeks modelling before writing a single line of application code. If the domain shifts (it will), the ontology lags.

The Trails approach: data-first, schema-later¶

load data → infer types → refine schema → ship

Trails provides four mechanisms that support this workflow, each adopted progressively:

Phase	Mechanism	Cost	What it does
1	`trails onto infer`	Free (SPARQL)	Statistical schema inference from existing data
2	`trails onto generate`	LLM tokens (~$0.001)	Describe your domain, get `@node_type` code
3	`trails onto refine`	Free (log analysis)	Usage-driven refinement suggestions
4	Admin dashboard	Free	Browser UI for continuous evolution

No phase depends on the previous one. Use any combination. The output is always plain Python @node_type code -- no magic, no runtime dependency on the generator.

Step 1: Load raw data (labels and edges, no types)¶

Start with data. You don't need @node_type declarations to write to the graph. Use ctx.kg.node() for label-first writes, or use RML mappings for structured sources.

Option A: Label-first writes¶

from trails import capability

@capability
def import_employees(ctx) -> dict:
    """Load employee data with no schema at all."""
    employees = [
        {"name": "Alice", "department": "Engineering", "salary": 95000},
        {"name": "Bob", "department": "Marketing", "salary": 82000},
        {"name": "Carol", "department": "Engineering", "salary": 91000,
         "manager": "Alice"},
    ]
    count = 0
    for emp in employees:
        ctx.kg.node(labels=["Employee"], properties=emp)
        count += 1
    return {"imported": count}

No type declarations. No SHACL shapes. Just labels and properties.

Option B: RML mapping from CSV¶

Given employees.csv:

id,name,department,salary,manager_id
1,Alice,Engineering,95000,
2,Bob,Marketing,82000,
3,Carol,Engineering,91000,1

And a mapping file mappings/employees.ttl:

@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql:  <http://semweb.mmlab.be/ns/ql#>.
@prefix rr:  <http://www.w3.org/ns/r2rml#>.
@prefix ex:  <https://myapp.example/>.

<#EmployeeMapping>
    rml:logicalSource [
        rml:source "employees.csv";
        rml:referenceFormulation ql:CSV
    ];
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}";
        rr:class ex:Employee
    ];
    rr:predicateObjectMap [
        rr:predicate ex:name;
        rr:objectMap [ rml:reference "name" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:department;
        rr:objectMap [ rml:reference "department" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:salary;
        rr:objectMap [ rml:reference "salary"; rr:datatype xsd:integer ]
    ].

Load it:

from trails.rml import run_mapping
from trails.testing import fresh_context

ctx = fresh_context()
result = run_mapping(ctx, "mappings/employees.ttl",
                     employees="employees.csv")
print(f"Loaded {result.triples_added} triples")

Either way, data is in the graph. No ontology needed yet.

Step 2: `trails onto infer` -- the framework discovers your schema¶

Now that data exists, ask the framework to analyse it:

trails onto infer -o models/inferred.py

The inference engine runs four deterministic passes:

Type discovery. Finds all distinct rdf:type values and their instance counts.
Predicate clustering. For each candidate type, collects every predicate used by its instances. When no explicit rdf:type exists, entities are clustered by Jaccard similarity (>= 80% shared predicates).
Type inference. Examines values: XSD annotations are respected first; heuristics apply for un-annotated values (all-integer -> int, all-IRI -> reference, etc.).
Cardinality inference. Counts values per subject per predicate to determine required and list[T].

Zero LLM cost. Pure SPARQL statistical analysis.

Python API¶

from trails.onto_infer import infer_schema, generate_code

schema = infer_schema(
    ctx.kg._store,
    trace_id="employee-analysis",
    min_instances=3,
    confidence=0.8,
)

print(f"Analyzed {schema.total_triples_analyzed} triples")
print(f"Found {len(schema.candidates)} candidate types")

for candidate in schema.candidates:
    print(f"\n{candidate.name} ({candidate.instance_count} instances, "
          f"confidence: {candidate.confidence:.2f})")
    for prop in candidate.properties:
        marker = "required" if prop.required else "optional"
        print(f"  {prop.name}: {prop.python_type} ({marker}, "
              f"confidence: {prop.confidence:.2f})")

# Generate the Python module
code = generate_code(schema)
print(code)

Reading the output¶

The generated file looks like hand-written code, with confidence annotations:

"""Auto-generated @node_type declarations.

Source: trails onto infer (deterministic)
"""
from __future__ import annotations

from trails.orm import node_type

# 47 instance(s), confidence: 0.94
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
    "name": str,         # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
    "department": str,   # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
    "salary": int,       # confidence: 0.96 | e.g. '95000', '82000'
    "manager": str,      # TODO: verify | confidence: 0.72 | e.g. 'https://...'
})
class Employee:
    pass

The markers tell you what to trust:

Marker	Meaning
`confidence: 1.00`	Every instance has this field -- high trust
`confidence: 0.72`	Only 72% of instances have this field -- review it
`TODO: verify`	Confidence below 0.8 -- the field may be noise
Instance count	More instances = more reliable inference

Step 3: Review and refine the suggestions¶

Open models/inferred.py and apply your domain knowledge:

# Before (generated):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": str,  # TODO: verify | confidence: 0.72
})
class Employee:
    pass

# After (reviewed):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": "Employee | None",  # optional self-reference
})
class Employee: ...

What to look for:

Low-confidence fields (TODO: verify): decide if they're real or noise. Remove noise, keep real fields as optional.
String-typed references: the manager field pointing to an IRI is probably a reference to another Employee. Change the type.
Missing optional markers: fields present in 72% of instances are probably optional -- add | None.

The generated code is now your code. Commit it. No runtime dependency on the generator.

Step 4: `trails onto generate` -- describe your domain, get types¶

When you need types for a domain that has no data yet, describe it in plain text:

trails onto generate --from domain.txt -o models/generated.py

Where domain.txt contains:

Clinical trial management. Patients enroll at Sites, which belong to
Studies. Each Study has multiple Arms. Adverse Events link to Patients
and Studies.

Python API¶

from trails.onto_generate import generate_schema, estimate_cost

# Always estimate cost first
est = estimate_cost(
    "Clinical trial management with patients, studies, sites, "
    "and adverse events.",
    model="haiku",
)
print(f"Estimated cost: ${est.estimated_cost_usd:.6f}")

# Generate
result = generate_schema(
    "Clinical trial management with patients, studies, sites, "
    "and adverse events. Each study has multiple sites. Patients "
    "enroll at a site. Adverse events link to patients and studies.",
    model="haiku",
    max_types=10,
)

print(f"Generated {len(result.types)} types: {result.types}")
print(f"Actual cost: ${result.cost_usd:.6f}")
print(result.code)

Model selection¶

Model	Use case	Cost per call
`haiku` (default)	Structural scaffolding, simple domains	~$0.001
`sonnet`	Domain-aware naming, nuanced relationships	~$0.01
`opus`	Complex multi-domain ontologies	~$0.10

The default is haiku -- cheapest adequate model.

Refine existing code with natural-language feedback:

from trails.onto_generate import refine_schema

refined = refine_schema(
    code=result.code,
    feedback="Add a Randomization type linking Patient to StudyArm. "
             "Make patient.email optional.",
    model="haiku",
)
print(refined.code)

Step 5: `trails onto refine` -- the framework learns from usage¶

Once your app runs in production, the SchemaAnalyzer observes runtime patterns and suggests improvements. Zero LLM cost -- pure log analysis.

Set up the collector¶

from trails.onto_refine import UsageCollector, SchemaAnalyzer

# Start collecting usage data (register as observability hook)
collector = UsageCollector()
collector.start()

# ... your app runs, writes data, queries data ...

Analyse and get suggestions¶

analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(
    min_samples=10,
    null_threshold=0.90,
    violation_threshold=0.10,
    co_query_threshold=0.80,
)

for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason} "
          f"(confidence: {s.confidence:.2f}, samples: {s.sample_count})")

Suggestion types¶

Kind	Trigger	Example
`make_optional`	Field null in >90% of writes	"'nickname' null in 95% of 200 writes"
`remove_field`	Written but never queried	"'internal_code' written 150x, never queried"
`relax_constraint`	SHACL violated >10% of writes	"'maxLength' on 'name' violated in 12%"
`add_index`	Fields co-queried >80%	"[name, department] co-queried in 85% of cases"
`add_relationship`	Types always co-occur in queries	"[Patient, Study] co-occur in 50 queries"

Generate migration code¶

from trails.onto_refine import generate_migration_code

code = generate_migration_code(report.suggestions)
print(code)

Output is a reviewable Python snippet -- no changes are applied without explicit human approval.

Step 6: The admin dashboard for continuous evolution¶

The trails-admin UI at /admin/ontology ties all three mechanisms together in one browser interface.

Mounting the dashboard¶

from fastapi import FastAPI
from trails_admin.routes.ontology import router

app = FastAPI()
app.include_router(router)
# Dashboard available at /admin/ontology

What you see¶

Current types -- every registered @node_type with IRI, fields, and inheritance.
Schema inference -- last run results with a "Run inference" button to trigger a new analysis.
Refinement suggestions -- active suggestions from the UsageCollector with confidence bars.
LLM generation -- form for domain-description generation with model selection and cost estimation.

The dashboard is the operator's view of ontology evolution. It replaces the need to remember CLI commands while providing the same functionality.

The progressive story: raw data -> types -> shapes -> OWL¶

Trails adds structure progressively. At each step, everything from the previous step keeps working:

Step 1: ctx.kg.node(labels=["Note"], properties={...})
        → Labels and edges. No types. No validation. Works.

Step 2: trails onto infer → @node_type declarations
        → Python types with field validation. Old label-first
          writes still work (different IRI namespace).

Step 3: @shape on the node types → SHACL validation
        → Cardinality, regex, min/max constraints enforced on write.
          Types without shapes still work.

Step 4: Export to OWL → formal ontology for interoperability
        → RDFS/OWL-level axioms for federation and reasoning.
          Everything below still works unchanged.

Each step is additive. You never need to rewrite what you already have.

Why this is a paradigm shift¶

Traditional KG engineering¶

1. Hire an ontology engineer
2. Run workshops with domain experts (2-4 weeks)
3. Model the ontology in Protege (2-4 weeks)
4. Implement validators against the schema
5. Build ingestion pipeline that matches the schema
6. Discover the schema doesn't match real data
7. Go back to step 2

Cost: Months of upfront investment. Schema changes are expensive. The ontology lags behind the data.

The Trails approach¶

1. Load your data (any format)
2. Run `trails onto infer` (seconds)
3. Review the suggestions (minutes)
4. Edit the generated code (minutes)
5. Ship
6. `trails onto refine` suggests improvements over time

Cost: Minutes to a working schema. Schema evolves with the data. The ontology is always current.

Key differences¶

Aspect	Traditional	Trails
Starting point	Whiteboard	Data
Schema creation	Manual, weeks	Automated, minutes
Schema evolution	Committee meetings	`trails onto refine` suggestions
Cost of change	High (remodel + revalidate)	Low (edit Python code)
Ontology format	OWL/Protege files	Plain Python `@node_type`
Runtime dependency	Ontology files loaded at boot	None (code is the schema)
Who does it	Ontology engineers	Any Python developer

Full walkthrough: from CSV to production ontology¶

Here is the complete journey, end to end.

1. Start with a CSV¶

id,name,dept,salary,hire_date,manager_id
1,Alice,Engineering,95000,2023-01-15,
2,Bob,Marketing,82000,2022-06-01,
3,Carol,Engineering,91000,2023-03-20,1
4,Dave,Marketing,78000,2024-01-10,2
5,Eve,Engineering,105000,2021-09-01,

2. Write an RML mapping¶

@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql:  <http://semweb.mmlab.be/ns/ql#>.
@prefix rr:  <http://www.w3.org/ns/r2rml#>.
@prefix ex:  <https://myapp.example/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

<#EmployeeMapping>
    rml:logicalSource [
        rml:source "employees.csv";
        rml:referenceFormulation ql:CSV
    ];
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}";
        rr:class ex:Employee
    ];
    rr:predicateObjectMap [
        rr:predicate ex:name;
        rr:objectMap [ rml:reference "name" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:dept;
        rr:objectMap [ rml:reference "dept" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:salary;
        rr:objectMap [ rml:reference "salary"; rr:datatype xsd:integer ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:hire_date;
        rr:objectMap [ rml:reference "hire_date"; rr:datatype xsd:date ]
    ].

3. Load the data¶

from trails.rml import run_mapping
from trails.testing import fresh_context

ctx = fresh_context()
result = run_mapping(ctx, "mappings/employees.ttl",
                     employees="employees.csv")
print(f"Loaded {result.triples_added} triples in {result.duration_ms:.0f}ms")
# Loaded 20 triples in 45ms

4. Infer the schema¶

trails onto infer --min-instances 3 -o models/employee.py

Output:

"""Auto-generated @node_type declarations.

Source: trails onto infer (deterministic)
"""
from __future__ import annotations

import datetime

from trails.orm import node_type

# 5 instance(s), confidence: 0.92
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
    "name": str,       # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
    "dept": str,       # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
    "salary": int,     # confidence: 1.00 | e.g. '95000', '82000'
    "hire_date": str,  # confidence: 1.00 | e.g. '2023-01-15'
})
class Employee:
    pass

5. Review and improve¶

import datetime
from trails import node_type

@node_type("Employee", fields={
    "name": str,
    "department": str,       # renamed from "dept" for clarity
    "salary": int,
    "hire_date": datetime.datetime,  # upgraded from str
    "manager": "Employee | None",    # added from domain knowledge
})
class Employee: ...

6. Use in capabilities¶

from trails import capability

@capability
def top_earners(ctx, min_salary: int = 90000) -> list:
    hits = Employee.where(salary__gte=min_salary).fetch(ctx)
    return [
        {"name": e.name, "department": e.department, "salary": e.salary}
        for e in hits
    ]

@capability
def team_report(ctx, department: str) -> dict:
    members = Employee.where(department=department).fetch(ctx)
    total = sum(m.salary for m in members)
    return {
        "department": department,
        "headcount": len(members),
        "total_salary": total,
        "members": [{"name": m.name, "salary": m.salary} for m in members],
    }

7. Let it refine over time¶

from trails.onto_refine import UsageCollector, SchemaAnalyzer

collector = UsageCollector()
collector.start()

# App runs for a week...

analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(min_samples=50)
for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason}")

# Example output:
# [add_index] Employee.department: Fields [name, department]
#   co-queried in 87% of cases — consider a composite index.
# [make_optional] Employee.manager: Field 'manager' is null in
#   92% of 500 writes — consider making it optional.

8. Monitor in the dashboard¶

pip install 'trails[admin]'
trails-admin --app myapp.main
# Open http://127.0.0.1:4455/admin/ontology

Chapter 8 — The Auto-Ontology Paradigm¶

Why ontology-first fails¶

The Trails approach: data-first, schema-later¶

Step 1: Load raw data (labels and edges, no types)¶

Option A: Label-first writes¶

Option B: RML mapping from CSV¶

Step 2: `trails onto infer` -- the framework discovers your schema¶

Python API¶

Reading the output¶

Step 3: Review and refine the suggestions¶

Step 4: `trails onto generate` -- describe your domain, get types¶

Python API¶

Model selection¶

Iterative refinement¶

Step 5: `trails onto refine` -- the framework learns from usage¶

Set up the collector¶

Analyse and get suggestions¶

Suggestion types¶

Generate migration code¶

Step 6: The admin dashboard for continuous evolution¶

Mounting the dashboard¶

What you see¶

The progressive story: raw data -> types -> shapes -> OWL¶

Why this is a paradigm shift¶

Traditional KG engineering¶

The Trails approach¶

Key differences¶

Full walkthrough: from CSV to production ontology¶

1. Start with a CSV¶

2. Write an RML mapping¶

3. Load the data¶

4. Infer the schema¶

5. Review and improve¶

6. Use in capabilities¶

7. Let it refine over time¶

8. Monitor in the dashboard¶

See also¶

Chapter 8 — The Auto-Ontology Paradigm¶

Why ontology-first fails¶

The Trails approach: data-first, schema-later¶

Step 1: Load raw data (labels and edges, no types)¶

Option A: Label-first writes¶

Option B: RML mapping from CSV¶

Step 2: trails onto infer -- the framework discovers your schema¶

Python API¶

Reading the output¶

Step 3: Review and refine the suggestions¶

Step 4: trails onto generate -- describe your domain, get types¶

Python API¶

Model selection¶

Iterative refinement¶

Step 5: trails onto refine -- the framework learns from usage¶

Set up the collector¶

Analyse and get suggestions¶

Suggestion types¶

Generate migration code¶

Step 6: The admin dashboard for continuous evolution¶

Mounting the dashboard¶

What you see¶

The progressive story: raw data -> types -> shapes -> OWL¶

Why this is a paradigm shift¶

Traditional KG engineering¶

The Trails approach¶

Key differences¶

Full walkthrough: from CSV to production ontology¶

1. Start with a CSV¶

2. Write an RML mapping¶

3. Load the data¶

4. Infer the schema¶

5. Review and improve¶

6. Use in capabilities¶

7. Let it refine over time¶

8. Monitor in the dashboard¶

See also¶

Step 2: `trails onto infer` -- the framework discovers your schema¶

Step 4: `trails onto generate` -- describe your domain, get types¶

Step 5: `trails onto refine` -- the framework learns from usage¶