Skip to content

Chapter 8 — The Auto-Ontology Paradigm

Traditional knowledge-graph engineering starts with an ontology workshop: domain experts, whiteboard sessions, weeks of modelling before a single triple reaches the store. Trails does the opposite. You load data first, and the framework discovers your schema.

This chapter walks through the full auto-ontology workflow -- from a raw CSV to a production-grade ontology -- and explains why this is a genuine paradigm shift.


Why ontology-first fails

The classic KG workflow is:

design ontology → build store → validate data → ship

This creates a chicken-and-egg problem:

  1. You cannot design a good ontology without data. Domain experts know their field but not how their data actually looks in practice. Fields that seem mandatory turn out optional in 40% of records. Relationships that seem obvious don't exist in the data.

  2. The ontology becomes a bottleneck. Every new data source requires an ontology committee meeting. A field that doesn't fit the schema gets dropped or force-fit. The ontology calcifies.

  3. The investment is front-loaded. You spend weeks modelling before writing a single line of application code. If the domain shifts (it will), the ontology lags.

The Trails approach: data-first, schema-later

load data → infer types → refine schema → ship

Trails provides four mechanisms that support this workflow, each adopted progressively:

Phase Mechanism Cost What it does
1 trails onto infer Free (SPARQL) Statistical schema inference from existing data
2 trails onto generate LLM tokens (~$0.001) Describe your domain, get @node_type code
3 trails onto refine Free (log analysis) Usage-driven refinement suggestions
4 Admin dashboard Free Browser UI for continuous evolution

No phase depends on the previous one. Use any combination. The output is always plain Python @node_type code -- no magic, no runtime dependency on the generator.


Step 1: Load raw data (labels and edges, no types)

Start with data. You don't need @node_type declarations to write to the graph. Use ctx.kg.node() for label-first writes, or use RML mappings for structured sources.

Option A: Label-first writes

from trails import capability

@capability
def import_employees(ctx) -> dict:
    """Load employee data with no schema at all."""
    employees = [
        {"name": "Alice", "department": "Engineering", "salary": 95000},
        {"name": "Bob", "department": "Marketing", "salary": 82000},
        {"name": "Carol", "department": "Engineering", "salary": 91000,
         "manager": "Alice"},
    ]
    count = 0
    for emp in employees:
        ctx.kg.node(labels=["Employee"], properties=emp)
        count += 1
    return {"imported": count}

No type declarations. No SHACL shapes. Just labels and properties.

Option B: RML mapping from CSV

Given employees.csv:

id,name,department,salary,manager_id
1,Alice,Engineering,95000,
2,Bob,Marketing,82000,
3,Carol,Engineering,91000,1

And a mapping file mappings/employees.ttl:

@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql:  <http://semweb.mmlab.be/ns/ql#>.
@prefix rr:  <http://www.w3.org/ns/r2rml#>.
@prefix ex:  <https://myapp.example/>.

<#EmployeeMapping>
    rml:logicalSource [
        rml:source "employees.csv";
        rml:referenceFormulation ql:CSV
    ];
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}";
        rr:class ex:Employee
    ];
    rr:predicateObjectMap [
        rr:predicate ex:name;
        rr:objectMap [ rml:reference "name" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:department;
        rr:objectMap [ rml:reference "department" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:salary;
        rr:objectMap [ rml:reference "salary"; rr:datatype xsd:integer ]
    ].

Load it:

from trails.rml import run_mapping
from trails.testing import fresh_context

ctx = fresh_context()
result = run_mapping(ctx, "mappings/employees.ttl",
                     employees="employees.csv")
print(f"Loaded {result.triples_added} triples")

Either way, data is in the graph. No ontology needed yet.


Step 2: trails onto infer -- the framework discovers your schema

Now that data exists, ask the framework to analyse it:

trails onto infer -o models/inferred.py

The inference engine runs four deterministic passes:

  1. Type discovery. Finds all distinct rdf:type values and their instance counts.
  2. Predicate clustering. For each candidate type, collects every predicate used by its instances. When no explicit rdf:type exists, entities are clustered by Jaccard similarity (>= 80% shared predicates).
  3. Type inference. Examines values: XSD annotations are respected first; heuristics apply for un-annotated values (all-integer -> int, all-IRI -> reference, etc.).
  4. Cardinality inference. Counts values per subject per predicate to determine required and list[T].

Zero LLM cost. Pure SPARQL statistical analysis.

Python API

from trails.onto_infer import infer_schema, generate_code

schema = infer_schema(
    ctx.kg._store,
    trace_id="employee-analysis",
    min_instances=3,
    confidence=0.8,
)

print(f"Analyzed {schema.total_triples_analyzed} triples")
print(f"Found {len(schema.candidates)} candidate types")

for candidate in schema.candidates:
    print(f"\n{candidate.name} ({candidate.instance_count} instances, "
          f"confidence: {candidate.confidence:.2f})")
    for prop in candidate.properties:
        marker = "required" if prop.required else "optional"
        print(f"  {prop.name}: {prop.python_type} ({marker}, "
              f"confidence: {prop.confidence:.2f})")

# Generate the Python module
code = generate_code(schema)
print(code)

Reading the output

The generated file looks like hand-written code, with confidence annotations:

"""Auto-generated @node_type declarations.

Source: trails onto infer (deterministic)
"""
from __future__ import annotations

from trails.orm import node_type

# 47 instance(s), confidence: 0.94
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
    "name": str,         # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
    "department": str,   # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
    "salary": int,       # confidence: 0.96 | e.g. '95000', '82000'
    "manager": str,      # TODO: verify | confidence: 0.72 | e.g. 'https://...'
})
class Employee:
    pass

The markers tell you what to trust:

Marker Meaning
confidence: 1.00 Every instance has this field -- high trust
confidence: 0.72 Only 72% of instances have this field -- review it
TODO: verify Confidence below 0.8 -- the field may be noise
Instance count More instances = more reliable inference

Step 3: Review and refine the suggestions

Open models/inferred.py and apply your domain knowledge:

# Before (generated):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": str,  # TODO: verify | confidence: 0.72
})
class Employee:
    pass

# After (reviewed):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": "Employee | None",  # optional self-reference
})
class Employee: ...

What to look for:

  • Low-confidence fields (TODO: verify): decide if they're real or noise. Remove noise, keep real fields as optional.
  • String-typed references: the manager field pointing to an IRI is probably a reference to another Employee. Change the type.
  • Missing optional markers: fields present in 72% of instances are probably optional -- add | None.

The generated code is now your code. Commit it. No runtime dependency on the generator.


Step 4: trails onto generate -- describe your domain, get types

When you need types for a domain that has no data yet, describe it in plain text:

trails onto generate --from domain.txt -o models/generated.py

Where domain.txt contains:

Clinical trial management. Patients enroll at Sites, which belong to
Studies. Each Study has multiple Arms. Adverse Events link to Patients
and Studies.

Python API

from trails.onto_generate import generate_schema, estimate_cost

# Always estimate cost first
est = estimate_cost(
    "Clinical trial management with patients, studies, sites, "
    "and adverse events.",
    model="haiku",
)
print(f"Estimated cost: ${est.estimated_cost_usd:.6f}")

# Generate
result = generate_schema(
    "Clinical trial management with patients, studies, sites, "
    "and adverse events. Each study has multiple sites. Patients "
    "enroll at a site. Adverse events link to patients and studies.",
    model="haiku",
    max_types=10,
)

print(f"Generated {len(result.types)} types: {result.types}")
print(f"Actual cost: ${result.cost_usd:.6f}")
print(result.code)

Model selection

Model Use case Cost per call
haiku (default) Structural scaffolding, simple domains ~$0.001
sonnet Domain-aware naming, nuanced relationships ~$0.01
opus Complex multi-domain ontologies ~$0.10

The default is haiku -- cheapest adequate model.

Iterative refinement

Refine existing code with natural-language feedback:

from trails.onto_generate import refine_schema

refined = refine_schema(
    code=result.code,
    feedback="Add a Randomization type linking Patient to StudyArm. "
             "Make patient.email optional.",
    model="haiku",
)
print(refined.code)

Step 5: trails onto refine -- the framework learns from usage

Once your app runs in production, the SchemaAnalyzer observes runtime patterns and suggests improvements. Zero LLM cost -- pure log analysis.

Set up the collector

from trails.onto_refine import UsageCollector, SchemaAnalyzer

# Start collecting usage data (register as observability hook)
collector = UsageCollector()
collector.start()

# ... your app runs, writes data, queries data ...

Analyse and get suggestions

analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(
    min_samples=10,
    null_threshold=0.90,
    violation_threshold=0.10,
    co_query_threshold=0.80,
)

for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason} "
          f"(confidence: {s.confidence:.2f}, samples: {s.sample_count})")

Suggestion types

Kind Trigger Example
make_optional Field null in >90% of writes "'nickname' null in 95% of 200 writes"
remove_field Written but never queried "'internal_code' written 150x, never queried"
relax_constraint SHACL violated >10% of writes "'maxLength' on 'name' violated in 12%"
add_index Fields co-queried >80% "[name, department] co-queried in 85% of cases"
add_relationship Types always co-occur in queries "[Patient, Study] co-occur in 50 queries"

Generate migration code

from trails.onto_refine import generate_migration_code

code = generate_migration_code(report.suggestions)
print(code)

Output is a reviewable Python snippet -- no changes are applied without explicit human approval.


Step 6: The admin dashboard for continuous evolution

The trails-admin UI at /admin/ontology ties all three mechanisms together in one browser interface.

Mounting the dashboard

from fastapi import FastAPI
from trails_admin.routes.ontology import router

app = FastAPI()
app.include_router(router)
# Dashboard available at /admin/ontology

What you see

  1. Current types -- every registered @node_type with IRI, fields, and inheritance.
  2. Schema inference -- last run results with a "Run inference" button to trigger a new analysis.
  3. Refinement suggestions -- active suggestions from the UsageCollector with confidence bars.
  4. LLM generation -- form for domain-description generation with model selection and cost estimation.

The dashboard is the operator's view of ontology evolution. It replaces the need to remember CLI commands while providing the same functionality.


The progressive story: raw data -> types -> shapes -> OWL

Trails adds structure progressively. At each step, everything from the previous step keeps working:

Step 1: ctx.kg.node(labels=["Note"], properties={...})
        → Labels and edges. No types. No validation. Works.

Step 2: trails onto infer → @node_type declarations
        → Python types with field validation. Old label-first
          writes still work (different IRI namespace).

Step 3: @shape on the node types → SHACL validation
        → Cardinality, regex, min/max constraints enforced on write.
          Types without shapes still work.

Step 4: Export to OWL → formal ontology for interoperability
        → RDFS/OWL-level axioms for federation and reasoning.
          Everything below still works unchanged.

Each step is additive. You never need to rewrite what you already have.


Why this is a paradigm shift

Traditional KG engineering

1. Hire an ontology engineer
2. Run workshops with domain experts (2-4 weeks)
3. Model the ontology in Protege (2-4 weeks)
4. Implement validators against the schema
5. Build ingestion pipeline that matches the schema
6. Discover the schema doesn't match real data
7. Go back to step 2

Cost: Months of upfront investment. Schema changes are expensive. The ontology lags behind the data.

The Trails approach

1. Load your data (any format)
2. Run `trails onto infer` (seconds)
3. Review the suggestions (minutes)
4. Edit the generated code (minutes)
5. Ship
6. `trails onto refine` suggests improvements over time

Cost: Minutes to a working schema. Schema evolves with the data. The ontology is always current.

Key differences

Aspect Traditional Trails
Starting point Whiteboard Data
Schema creation Manual, weeks Automated, minutes
Schema evolution Committee meetings trails onto refine suggestions
Cost of change High (remodel + revalidate) Low (edit Python code)
Ontology format OWL/Protege files Plain Python @node_type
Runtime dependency Ontology files loaded at boot None (code is the schema)
Who does it Ontology engineers Any Python developer

Full walkthrough: from CSV to production ontology

Here is the complete journey, end to end.

1. Start with a CSV

id,name,dept,salary,hire_date,manager_id
1,Alice,Engineering,95000,2023-01-15,
2,Bob,Marketing,82000,2022-06-01,
3,Carol,Engineering,91000,2023-03-20,1
4,Dave,Marketing,78000,2024-01-10,2
5,Eve,Engineering,105000,2021-09-01,

2. Write an RML mapping

@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql:  <http://semweb.mmlab.be/ns/ql#>.
@prefix rr:  <http://www.w3.org/ns/r2rml#>.
@prefix ex:  <https://myapp.example/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

<#EmployeeMapping>
    rml:logicalSource [
        rml:source "employees.csv";
        rml:referenceFormulation ql:CSV
    ];
    rr:subjectMap [
        rr:template "https://myapp.example/Employee/{id}";
        rr:class ex:Employee
    ];
    rr:predicateObjectMap [
        rr:predicate ex:name;
        rr:objectMap [ rml:reference "name" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:dept;
        rr:objectMap [ rml:reference "dept" ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:salary;
        rr:objectMap [ rml:reference "salary"; rr:datatype xsd:integer ]
    ];
    rr:predicateObjectMap [
        rr:predicate ex:hire_date;
        rr:objectMap [ rml:reference "hire_date"; rr:datatype xsd:date ]
    ].

3. Load the data

from trails.rml import run_mapping
from trails.testing import fresh_context

ctx = fresh_context()
result = run_mapping(ctx, "mappings/employees.ttl",
                     employees="employees.csv")
print(f"Loaded {result.triples_added} triples in {result.duration_ms:.0f}ms")
# Loaded 20 triples in 45ms

4. Infer the schema

trails onto infer --min-instances 3 -o models/employee.py

Output:

"""Auto-generated @node_type declarations.

Source: trails onto infer (deterministic)
"""
from __future__ import annotations

import datetime

from trails.orm import node_type

# 5 instance(s), confidence: 0.92
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
    "name": str,       # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
    "dept": str,       # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
    "salary": int,     # confidence: 1.00 | e.g. '95000', '82000'
    "hire_date": str,  # confidence: 1.00 | e.g. '2023-01-15'
})
class Employee:
    pass

5. Review and improve

import datetime
from trails import node_type

@node_type("Employee", fields={
    "name": str,
    "department": str,       # renamed from "dept" for clarity
    "salary": int,
    "hire_date": datetime.datetime,  # upgraded from str
    "manager": "Employee | None",    # added from domain knowledge
})
class Employee: ...

6. Use in capabilities

from trails import capability

@capability
def top_earners(ctx, min_salary: int = 90000) -> list:
    hits = Employee.where(salary__gte=min_salary).fetch(ctx)
    return [
        {"name": e.name, "department": e.department, "salary": e.salary}
        for e in hits
    ]

@capability
def team_report(ctx, department: str) -> dict:
    members = Employee.where(department=department).fetch(ctx)
    total = sum(m.salary for m in members)
    return {
        "department": department,
        "headcount": len(members),
        "total_salary": total,
        "members": [{"name": m.name, "salary": m.salary} for m in members],
    }

7. Let it refine over time

from trails.onto_refine import UsageCollector, SchemaAnalyzer

collector = UsageCollector()
collector.start()

# App runs for a week...

analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(min_samples=50)
for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason}")

# Example output:
# [add_index] Employee.department: Fields [name, department]
#   co-queried in 87% of cases — consider a composite index.
# [make_optional] Employee.manager: Field 'manager' is null in
#   92% of 500 writes — consider making it optional.

8. Monitor in the dashboard

pip install 'trails[admin]'
trails-admin --app myapp.main
# Open http://127.0.0.1:4455/admin/ontology

See also