Chapter 8 — The Auto-Ontology Paradigm¶
Traditional knowledge-graph engineering starts with an ontology workshop: domain experts, whiteboard sessions, weeks of modelling before a single triple reaches the store. Trails does the opposite. You load data first, and the framework discovers your schema.
This chapter walks through the full auto-ontology workflow -- from a raw CSV to a production-grade ontology -- and explains why this is a genuine paradigm shift.
Why ontology-first fails¶
The classic KG workflow is:
This creates a chicken-and-egg problem:
-
You cannot design a good ontology without data. Domain experts know their field but not how their data actually looks in practice. Fields that seem mandatory turn out optional in 40% of records. Relationships that seem obvious don't exist in the data.
-
The ontology becomes a bottleneck. Every new data source requires an ontology committee meeting. A field that doesn't fit the schema gets dropped or force-fit. The ontology calcifies.
-
The investment is front-loaded. You spend weeks modelling before writing a single line of application code. If the domain shifts (it will), the ontology lags.
The Trails approach: data-first, schema-later¶
Trails provides four mechanisms that support this workflow, each adopted progressively:
| Phase | Mechanism | Cost | What it does |
|---|---|---|---|
| 1 | trails onto infer |
Free (SPARQL) | Statistical schema inference from existing data |
| 2 | trails onto generate |
LLM tokens (~$0.001) | Describe your domain, get @node_type code |
| 3 | trails onto refine |
Free (log analysis) | Usage-driven refinement suggestions |
| 4 | Admin dashboard | Free | Browser UI for continuous evolution |
No phase depends on the previous one. Use any combination. The output
is always plain Python @node_type code -- no magic, no runtime
dependency on the generator.
Step 1: Load raw data (labels and edges, no types)¶
Start with data. You don't need @node_type declarations to write to
the graph. Use ctx.kg.node() for label-first writes, or use RML
mappings for structured sources.
Option A: Label-first writes¶
from trails import capability
@capability
def import_employees(ctx) -> dict:
"""Load employee data with no schema at all."""
employees = [
{"name": "Alice", "department": "Engineering", "salary": 95000},
{"name": "Bob", "department": "Marketing", "salary": 82000},
{"name": "Carol", "department": "Engineering", "salary": 91000,
"manager": "Alice"},
]
count = 0
for emp in employees:
ctx.kg.node(labels=["Employee"], properties=emp)
count += 1
return {"imported": count}
No type declarations. No SHACL shapes. Just labels and properties.
Option B: RML mapping from CSV¶
Given employees.csv:
id,name,department,salary,manager_id
1,Alice,Engineering,95000,
2,Bob,Marketing,82000,
3,Carol,Engineering,91000,1
And a mapping file mappings/employees.ttl:
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix ex: <https://myapp.example/>.
<#EmployeeMapping>
rml:logicalSource [
rml:source "employees.csv";
rml:referenceFormulation ql:CSV
];
rr:subjectMap [
rr:template "https://myapp.example/Employee/{id}";
rr:class ex:Employee
];
rr:predicateObjectMap [
rr:predicate ex:name;
rr:objectMap [ rml:reference "name" ]
];
rr:predicateObjectMap [
rr:predicate ex:department;
rr:objectMap [ rml:reference "department" ]
];
rr:predicateObjectMap [
rr:predicate ex:salary;
rr:objectMap [ rml:reference "salary"; rr:datatype xsd:integer ]
].
Load it:
from trails.rml import run_mapping
from trails.testing import fresh_context
ctx = fresh_context()
result = run_mapping(ctx, "mappings/employees.ttl",
employees="employees.csv")
print(f"Loaded {result.triples_added} triples")
Either way, data is in the graph. No ontology needed yet.
Step 2: trails onto infer -- the framework discovers your schema¶
Now that data exists, ask the framework to analyse it:
The inference engine runs four deterministic passes:
- Type discovery. Finds all distinct
rdf:typevalues and their instance counts. - Predicate clustering. For each candidate type, collects every
predicate used by its instances. When no explicit
rdf:typeexists, entities are clustered by Jaccard similarity (>= 80% shared predicates). - Type inference. Examines values: XSD annotations are respected
first; heuristics apply for un-annotated values (all-integer ->
int, all-IRI -> reference, etc.). - Cardinality inference. Counts values per subject per predicate to
determine
requiredandlist[T].
Zero LLM cost. Pure SPARQL statistical analysis.
Python API¶
from trails.onto_infer import infer_schema, generate_code
schema = infer_schema(
ctx.kg._store,
trace_id="employee-analysis",
min_instances=3,
confidence=0.8,
)
print(f"Analyzed {schema.total_triples_analyzed} triples")
print(f"Found {len(schema.candidates)} candidate types")
for candidate in schema.candidates:
print(f"\n{candidate.name} ({candidate.instance_count} instances, "
f"confidence: {candidate.confidence:.2f})")
for prop in candidate.properties:
marker = "required" if prop.required else "optional"
print(f" {prop.name}: {prop.python_type} ({marker}, "
f"confidence: {prop.confidence:.2f})")
# Generate the Python module
code = generate_code(schema)
print(code)
Reading the output¶
The generated file looks like hand-written code, with confidence annotations:
"""Auto-generated @node_type declarations.
Source: trails onto infer (deterministic)
"""
from __future__ import annotations
from trails.orm import node_type
# 47 instance(s), confidence: 0.94
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
"name": str, # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
"department": str, # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
"salary": int, # confidence: 0.96 | e.g. '95000', '82000'
"manager": str, # TODO: verify | confidence: 0.72 | e.g. 'https://...'
})
class Employee:
pass
The markers tell you what to trust:
| Marker | Meaning |
|---|---|
confidence: 1.00 |
Every instance has this field -- high trust |
confidence: 0.72 |
Only 72% of instances have this field -- review it |
TODO: verify |
Confidence below 0.8 -- the field may be noise |
| Instance count | More instances = more reliable inference |
Step 3: Review and refine the suggestions¶
Open models/inferred.py and apply your domain knowledge:
# Before (generated):
@node_type("Employee", fields={
"name": str,
"department": str,
"salary": int,
"manager": str, # TODO: verify | confidence: 0.72
})
class Employee:
pass
# After (reviewed):
@node_type("Employee", fields={
"name": str,
"department": str,
"salary": int,
"manager": "Employee | None", # optional self-reference
})
class Employee: ...
What to look for:
- Low-confidence fields (
TODO: verify): decide if they're real or noise. Remove noise, keep real fields as optional. - String-typed references: the
managerfield pointing to an IRI is probably a reference to anotherEmployee. Change the type. - Missing optional markers: fields present in 72% of instances are
probably optional -- add
| None.
The generated code is now your code. Commit it. No runtime dependency on the generator.
Step 4: trails onto generate -- describe your domain, get types¶
When you need types for a domain that has no data yet, describe it in plain text:
Where domain.txt contains:
Clinical trial management. Patients enroll at Sites, which belong to
Studies. Each Study has multiple Arms. Adverse Events link to Patients
and Studies.
Python API¶
from trails.onto_generate import generate_schema, estimate_cost
# Always estimate cost first
est = estimate_cost(
"Clinical trial management with patients, studies, sites, "
"and adverse events.",
model="haiku",
)
print(f"Estimated cost: ${est.estimated_cost_usd:.6f}")
# Generate
result = generate_schema(
"Clinical trial management with patients, studies, sites, "
"and adverse events. Each study has multiple sites. Patients "
"enroll at a site. Adverse events link to patients and studies.",
model="haiku",
max_types=10,
)
print(f"Generated {len(result.types)} types: {result.types}")
print(f"Actual cost: ${result.cost_usd:.6f}")
print(result.code)
Model selection¶
| Model | Use case | Cost per call |
|---|---|---|
haiku (default) |
Structural scaffolding, simple domains | ~$0.001 |
sonnet |
Domain-aware naming, nuanced relationships | ~$0.01 |
opus |
Complex multi-domain ontologies | ~$0.10 |
The default is haiku -- cheapest adequate model.
Iterative refinement¶
Refine existing code with natural-language feedback:
from trails.onto_generate import refine_schema
refined = refine_schema(
code=result.code,
feedback="Add a Randomization type linking Patient to StudyArm. "
"Make patient.email optional.",
model="haiku",
)
print(refined.code)
Step 5: trails onto refine -- the framework learns from usage¶
Once your app runs in production, the SchemaAnalyzer observes runtime
patterns and suggests improvements. Zero LLM cost -- pure log analysis.
Set up the collector¶
from trails.onto_refine import UsageCollector, SchemaAnalyzer
# Start collecting usage data (register as observability hook)
collector = UsageCollector()
collector.start()
# ... your app runs, writes data, queries data ...
Analyse and get suggestions¶
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(
min_samples=10,
null_threshold=0.90,
violation_threshold=0.10,
co_query_threshold=0.80,
)
for s in report.suggestions:
print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason} "
f"(confidence: {s.confidence:.2f}, samples: {s.sample_count})")
Suggestion types¶
| Kind | Trigger | Example |
|---|---|---|
make_optional |
Field null in >90% of writes | "'nickname' null in 95% of 200 writes" |
remove_field |
Written but never queried | "'internal_code' written 150x, never queried" |
relax_constraint |
SHACL violated >10% of writes | "'maxLength' on 'name' violated in 12%" |
add_index |
Fields co-queried >80% | "[name, department] co-queried in 85% of cases" |
add_relationship |
Types always co-occur in queries | "[Patient, Study] co-occur in 50 queries" |
Generate migration code¶
from trails.onto_refine import generate_migration_code
code = generate_migration_code(report.suggestions)
print(code)
Output is a reviewable Python snippet -- no changes are applied without explicit human approval.
Step 6: The admin dashboard for continuous evolution¶
The trails-admin UI at /admin/ontology ties all three mechanisms
together in one browser interface.
Mounting the dashboard¶
from fastapi import FastAPI
from trails_admin.routes.ontology import router
app = FastAPI()
app.include_router(router)
# Dashboard available at /admin/ontology
What you see¶
- Current types -- every registered
@node_typewith IRI, fields, and inheritance. - Schema inference -- last run results with a "Run inference" button to trigger a new analysis.
- Refinement suggestions -- active suggestions from the
UsageCollectorwith confidence bars. - LLM generation -- form for domain-description generation with model selection and cost estimation.
The dashboard is the operator's view of ontology evolution. It replaces the need to remember CLI commands while providing the same functionality.
The progressive story: raw data -> types -> shapes -> OWL¶
Trails adds structure progressively. At each step, everything from the previous step keeps working:
Step 1: ctx.kg.node(labels=["Note"], properties={...})
→ Labels and edges. No types. No validation. Works.
Step 2: trails onto infer → @node_type declarations
→ Python types with field validation. Old label-first
writes still work (different IRI namespace).
Step 3: @shape on the node types → SHACL validation
→ Cardinality, regex, min/max constraints enforced on write.
Types without shapes still work.
Step 4: Export to OWL → formal ontology for interoperability
→ RDFS/OWL-level axioms for federation and reasoning.
Everything below still works unchanged.
Each step is additive. You never need to rewrite what you already have.
Why this is a paradigm shift¶
Traditional KG engineering¶
1. Hire an ontology engineer
2. Run workshops with domain experts (2-4 weeks)
3. Model the ontology in Protege (2-4 weeks)
4. Implement validators against the schema
5. Build ingestion pipeline that matches the schema
6. Discover the schema doesn't match real data
7. Go back to step 2
Cost: Months of upfront investment. Schema changes are expensive. The ontology lags behind the data.
The Trails approach¶
1. Load your data (any format)
2. Run `trails onto infer` (seconds)
3. Review the suggestions (minutes)
4. Edit the generated code (minutes)
5. Ship
6. `trails onto refine` suggests improvements over time
Cost: Minutes to a working schema. Schema evolves with the data. The ontology is always current.
Key differences¶
| Aspect | Traditional | Trails |
|---|---|---|
| Starting point | Whiteboard | Data |
| Schema creation | Manual, weeks | Automated, minutes |
| Schema evolution | Committee meetings | trails onto refine suggestions |
| Cost of change | High (remodel + revalidate) | Low (edit Python code) |
| Ontology format | OWL/Protege files | Plain Python @node_type |
| Runtime dependency | Ontology files loaded at boot | None (code is the schema) |
| Who does it | Ontology engineers | Any Python developer |
Full walkthrough: from CSV to production ontology¶
Here is the complete journey, end to end.
1. Start with a CSV¶
id,name,dept,salary,hire_date,manager_id
1,Alice,Engineering,95000,2023-01-15,
2,Bob,Marketing,82000,2022-06-01,
3,Carol,Engineering,91000,2023-03-20,1
4,Dave,Marketing,78000,2024-01-10,2
5,Eve,Engineering,105000,2021-09-01,
2. Write an RML mapping¶
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix ex: <https://myapp.example/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
<#EmployeeMapping>
rml:logicalSource [
rml:source "employees.csv";
rml:referenceFormulation ql:CSV
];
rr:subjectMap [
rr:template "https://myapp.example/Employee/{id}";
rr:class ex:Employee
];
rr:predicateObjectMap [
rr:predicate ex:name;
rr:objectMap [ rml:reference "name" ]
];
rr:predicateObjectMap [
rr:predicate ex:dept;
rr:objectMap [ rml:reference "dept" ]
];
rr:predicateObjectMap [
rr:predicate ex:salary;
rr:objectMap [ rml:reference "salary"; rr:datatype xsd:integer ]
];
rr:predicateObjectMap [
rr:predicate ex:hire_date;
rr:objectMap [ rml:reference "hire_date"; rr:datatype xsd:date ]
].
3. Load the data¶
from trails.rml import run_mapping
from trails.testing import fresh_context
ctx = fresh_context()
result = run_mapping(ctx, "mappings/employees.ttl",
employees="employees.csv")
print(f"Loaded {result.triples_added} triples in {result.duration_ms:.0f}ms")
# Loaded 20 triples in 45ms
4. Infer the schema¶
Output:
"""Auto-generated @node_type declarations.
Source: trails onto infer (deterministic)
"""
from __future__ import annotations
import datetime
from trails.orm import node_type
# 5 instance(s), confidence: 0.92
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
"name": str, # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
"dept": str, # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
"salary": int, # confidence: 1.00 | e.g. '95000', '82000'
"hire_date": str, # confidence: 1.00 | e.g. '2023-01-15'
})
class Employee:
pass
5. Review and improve¶
import datetime
from trails import node_type
@node_type("Employee", fields={
"name": str,
"department": str, # renamed from "dept" for clarity
"salary": int,
"hire_date": datetime.datetime, # upgraded from str
"manager": "Employee | None", # added from domain knowledge
})
class Employee: ...
6. Use in capabilities¶
from trails import capability
@capability
def top_earners(ctx, min_salary: int = 90000) -> list:
hits = Employee.where(salary__gte=min_salary).fetch(ctx)
return [
{"name": e.name, "department": e.department, "salary": e.salary}
for e in hits
]
@capability
def team_report(ctx, department: str) -> dict:
members = Employee.where(department=department).fetch(ctx)
total = sum(m.salary for m in members)
return {
"department": department,
"headcount": len(members),
"total_salary": total,
"members": [{"name": m.name, "salary": m.salary} for m in members],
}
7. Let it refine over time¶
from trails.onto_refine import UsageCollector, SchemaAnalyzer
collector = UsageCollector()
collector.start()
# App runs for a week...
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(min_samples=50)
for s in report.suggestions:
print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason}")
# Example output:
# [add_index] Employee.department: Fields [name, department]
# co-queried in 87% of cases — consider a composite index.
# [make_optional] Employee.manager: Field 'manager' is null in
# 92% of 500 writes — consider making it optional.
8. Monitor in the dashboard¶
pip install 'trails[admin]'
trails-admin --app myapp.main
# Open http://127.0.0.1:4455/admin/ontology
See also¶
- Auto-Ontology Guide -- full API reference for all four phases
- ActiveGraph ORM --
@node_type, the output format of inference - Shapes & Validation --
@shapefor SHACL constraints - RML Data Mapping -- declarative data loading
- Admin UI -- the operator dashboard