Auto-Ontology¶
Traditional knowledge-graph development is schema-first: design the
ontology, build the store, then validate incoming data against the
schema. Trails flips this. The trails.onto_infer module analyses data
already in your store and infers candidate @node_type declarations —
deterministic, zero LLM cost, pure SPARQL statistical analysis. The full
design lives in
ADR-0025; the
progressive-enhancement framing that keeps auto-ontology additive is
ADR-0021.
The paradigm shift: data-first¶
The classic workflow is:
The Trails workflow is:
Instead of spending weeks in ontology workshops before having any data,
you populate the graph first. The framework observes the data and
suggests the schema. Generated @node_type code is indistinguishable
from hand-written code — plain Python, no hidden state, no runtime
dependency on the generator.
Three mechanisms support this workflow, adopted progressively:
| Phase | Mechanism | Cost | Status |
|---|---|---|---|
| 1 | trails onto infer — statistical schema inference |
Free (SPARQL) | Implemented |
| 2 | trails onto generate — LLM-assisted generation |
LLM tokens | Implemented |
| 3 | trails onto refine — usage-driven refinement |
Free (log analysis) | Implemented |
| 4 | Admin dashboard — ontology evolution UI | Free | Implemented |
trails onto infer — schema inference from existing data¶
How it works¶
The inference engine runs four analysis passes over the store:
-
Type discovery. Find all distinct
rdf:typevalues and their instance counts. Each type with enough instances becomes a candidate node type. -
Predicate clustering. For each candidate type, collect every predicate used by its instances (excluding metadata predicates like
rdf:type,rdfs:label,rdfs:comment). When no explicitrdf:typeexists, subjects are clustered by Jaccard similarity of their predicate sets — entities that share >= 80% of their predicates are grouped as candidates. -
Type inference. For each predicate, examine all values across instances:
- XSD datatype annotations are respected first (
xsd:integermaps toint,xsd:dateTimetodatetime, etc.). - When no annotation exists, heuristics apply: all-integer values →
int, all-float →float, all-boolean →bool, all-IRI → reference. -
Values that consistently point to instances of another candidate type are detected as reference fields.
-
Cardinality inference. Count values per subject per predicate:
- If every instance has at least one value →
required=True. - If any instance has more than one value →
list[T].
CLI usage¶
# Infer from the default project store (prints to stdout)
trails onto infer
# Write output to a Python module
trails onto infer -o models/inferred.py
# Require at least 5 instances per candidate type
trails onto infer --min-instances 5
# Set minimum confidence threshold
trails onto infer --min-confidence 0.8
Python API¶
from trails.onto_infer import infer_schema, generate_code
# Infer schema from a store
schema = infer_schema(
ctx.kg._store,
trace_id="my-analysis",
min_instances=3,
confidence=0.8,
)
print(f"Analyzed {schema.total_triples_analyzed} triples")
print(f"Found {len(schema.candidates)} candidate types")
for candidate in schema.candidates:
print(f"\n{candidate.name} ({candidate.instance_count} instances, "
f"confidence: {candidate.confidence:.2f})")
for prop in candidate.properties:
marker = "required" if prop.required else "optional"
print(f" {prop.name}: {prop.python_type} ({marker}, "
f"confidence: {prop.confidence:.2f})")
# Generate Python code from the schema
code = generate_code(schema)
print(code)
Reading the output¶
The generated code includes confidence annotations so you know which declarations to trust and which to review:
"""Auto-generated @node_type declarations.
Source: trails onto infer (deterministic)
"""
from __future__ import annotations
import datetime
from trails.orm import node_type
# 47 instance(s), confidence: 0.94
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
"name": str, # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
"department": str, # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
"salary": int, # confidence: 0.96 | e.g. '95000', '82000'
"manager": str, # TODO: verify | confidence: 0.72 | e.g. 'https://...'
})
class Employee:
pass
Key markers to look for:
| Marker | Meaning |
|---|---|
confidence: 1.00 |
Every instance has this field — high trust |
confidence: 0.72 |
Only 72% of instances have this field — review it |
TODO: verify |
Confidence below 0.8 — the field may be noise |
source rdf:type: <iri> |
The RDF type IRI the candidate was derived from |
| Instance count | More instances = more reliable inference |
JSON output¶
The InferredSchema can be serialized for programmatic consumption:
{
"candidates": [
{
"name": "Employee",
"source_type_iri": "https://myapp.example/Employee",
"properties": [
{
"name": "name",
"predicate_iri": "https://myapp.example/name",
"python_type": "str",
"required": true,
"is_list": false,
"is_ref": false,
"ref_target": null,
"confidence": 1.0,
"sample_values": ["Alice", "Bob", "Carol"]
}
],
"instance_count": 47,
"confidence": 0.94
}
],
"total_triples_analyzed": 1234,
"total_subjects_analyzed": 89
}
trails onto generate — LLM-assisted generation (Phase 2)¶
Describe your domain in plain text and the framework produces
@node_type code via an LLM. The output is indistinguishable from
hand-written code — no magic, no hidden state, no runtime dependency
on the generator.
CLI usage¶
# From a file
trails onto generate --from domain.txt
# Specify model tier
trails onto generate --model haiku # cheap, structural scaffolding (~$0.001)
trails onto generate --model sonnet # domain-aware nuance (~$0.01)
# Dry run: show estimated cost without calling the LLM
trails onto generate --from domain.txt --dry-run
# Write output to a Python module
trails onto generate --from domain.txt -o models/generated.py
# Iterative refinement on existing code
trails onto generate --from domain.txt --refine models/v1.py
Python API¶
from trails.onto_generate import generate_schema, refine_schema, estimate_cost
# Estimate cost before spending
est = estimate_cost(
"Clinical trial management with patients, studies, sites, and adverse events.",
model="haiku",
)
print(f"Model: {est.model}, ~{est.prompt_tokens} prompt tokens, "
f"estimated cost: ${est.estimated_cost_usd:.6f}")
# Generate (requires an LLM client configured in trails.toml or passed explicitly)
result = generate_schema(
"Clinical trial management with patients, studies, sites, "
"and adverse events. Each study has multiple sites. Patients "
"enroll at a site. Adverse events link to patients and studies.",
model="haiku",
max_types=10,
)
print(f"Generated {len(result.types)} types: {result.types}")
print(f"Cost: ${result.cost_usd:.6f} ({result.model_used})")
print()
print(result.code)
Model selection¶
| Model | Use case | Approximate cost per call |
|---|---|---|
haiku (default) |
Structural scaffolding, simple domains | ~$0.001 |
sonnet |
Domain-aware naming, nuanced relationships | ~$0.01 |
opus |
Complex multi-domain ontologies | ~$0.10 |
The default is haiku — cheapest adequate model per the framework's
cost-awareness principle. Use sonnet when domain naming quality
matters (e.g. medical, legal).
Iterative refinement¶
Refine existing generated code with natural-language feedback:
from trails.onto_generate import refine_schema
refined = refine_schema(
code=result.code,
feedback="Add a Randomization type linking Patient to StudyArm. "
"Make patient.email optional.",
model="haiku",
)
print(refined.code)
Output format¶
Every generated file carries a header identifying its source:
"""Auto-generated @node_type declarations.
Source: trails onto generate (LLM: claude-haiku-4-5, 2026-04-17)
Domain: Clinical trial management with patients, studies, sites, and adverse...
"""
from __future__ import annotations
import datetime
from trails.orm import node_type
@node_type("Site", fields={
"name": str,
"address": str,
"principal_investigator": str,
})
class Site:
"""A clinical trial site where patients are enrolled."""
pass
# ... more types ...
The code is plain Python. Edit it, commit it, import it — no runtime dependency on the generator.
trails onto refine — usage-driven refinement (Phase 3)¶
The SchemaAnalyzer observes runtime patterns — field writes, query
access, SHACL validation failures, null rates — and suggests schema
improvements. Deterministic analysis, zero LLM cost.
CLI usage¶
# Analyse usage patterns and print suggestions
trails onto refine
# Include onto-refine section in the doctor report
trails doctor
# Generate migration code for accepted suggestions
trails onto refine --apply
Architecture: UsageCollector + SchemaAnalyzer¶
Two components work together:
-
UsageCollector— a lightweight observer hook that accumulates statistics fromkg_write,kg_query, andshacl_violationevents. Does O(1) work per event. Thread-safe. -
SchemaAnalyzer— runs a single pass over the accumulated counters and produces aRefineReportwith actionableSuggestionobjects.
from trails.onto_refine import UsageCollector, SchemaAnalyzer
# Start collecting usage data (register as observability hook)
collector = UsageCollector()
collector.start()
# ... your app runs, writes data, queries data ...
# Later: analyse and get suggestions
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(
min_samples=10, # skip types with fewer than 10 writes
null_threshold=0.90, # suggest optional if >90% null
violation_threshold=0.10, # flag constraints violated >10%
co_query_threshold=0.80, # suggest index if co-queried >80%
)
for s in report.suggestions:
print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason} "
f"(confidence: {s.confidence:.2f}, samples: {s.sample_count})")
# Stop collecting when done
collector.stop()
Suggestion types¶
| Kind | Trigger | Example |
|---|---|---|
make_optional |
Field null in >90% of writes | "Field 'nickname' is null in 95% of 200 writes" |
remove_field |
Field written but never queried | "Field 'internal_code' written 150 times but never queried" |
relax_constraint |
SHACL constraint violated in >10% of writes | "Constraint 'maxLength' on 'name' violated in 12% of writes" |
add_index |
Fields queried together in >80% of cases | "Fields [name, department] co-queried in 85% of cases" |
add_relationship |
Node types always co-occur in queries | "Types [Patient, Study] co-occur in 50 queries" |
Manual event recording (testing)¶
For testing or external data, record events manually without the observability hook:
collector = UsageCollector()
# Simulate writes
collector.record_write("Employee", fields=["name", "salary"])
collector.record_write("Employee", fields=["name"], null_fields=["salary"])
# Simulate queries
collector.record_query("Employee", fields=["name", "department"])
# Simulate SHACL violations
collector.record_shacl_violation("Employee", "salary", "minInclusive")
# Analyse
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(min_samples=1)
Migration code generation¶
Generate reviewable Python migration snippets from accepted suggestions:
from trails.onto_refine import generate_migration_code
code = generate_migration_code(report.suggestions)
print(code)
Output:
"""Migration code generated by `trails onto refine --apply`.
Review each change before applying.
"""
from __future__ import annotations
# --- Suggestion 1: make_optional ---
# Node type: Employee
# Field: nickname
# Reason: Field 'nickname' is null in 95% of 200 writes — consider making it optional.
# Confidence: 0.95
# In your @node_type/Employee definition:
# Change: "nickname": str
# To: "nickname": str | None
# Or in @shape: predicate(..., required=False)
Refinement is progressive and non-blocking. Suggestions accumulate and are surfaced on demand. No schema change is applied without explicit human approval.
Admin dashboard for ontology evolution (Phase 4)¶
The trails-admin UI includes an ontology dashboard at
/admin/ontology that ties together all three mechanisms (infer,
generate, refine) in a single browser interface.
Routes¶
| Method | Path | Description |
|---|---|---|
GET |
/admin/ontology |
Overview: current @node_type definitions, last inference, active suggestions, generation link |
POST |
/admin/ontology/infer |
Trigger a deterministic inference run against the kernel store |
GET |
/admin/ontology/generate |
Form for LLM-assisted generation with model selection and cost estimation |
POST |
/admin/ontology/generate |
Submit a generation request (estimate or generate) |
GET |
/admin/ontology/suggestions |
List refinement suggestions with confidence bars |
Overview page¶
The overview page shows four sections:
- Current
@node_typedefinitions — lists all registered types with their IRI, fields, and inheritance. - Schema Inference — displays the last inference run's results (triple count, candidates, confidence). A "Run inference" button triggers a new run.
- Refinement Suggestions — shows the count of active suggestions
from the
UsageCollectorwith a link to the full list. - LLM-Assisted Generation — links to the generation form.
Generation form¶
The generation form accepts a domain description and model selection (Haiku or Sonnet). Two actions:
- Estimate cost — shows prompt token count and estimated USD without calling the LLM.
- Generate — calls the LLM and displays the generated
@node_typecode inline.
Suggestions page¶
Displays all refinement suggestions in a table with columns for kind,
node type, field, confidence (rendered as a colored bar), reason, and
sample count. This is the same data as trails onto refine but rendered
in the browser.
Mounting the dashboard¶
from fastapi import FastAPI
from trails_admin.routes.ontology import router
app = FastAPI()
app.include_router(router)
# Dashboard available at /admin/ontology
The progressive story¶
Here is the complete workflow from raw data to a mature ontology:
Step 1: Load CSV data¶
from trails.rml import run_mapping
result = run_mapping(ctx, "mappings/employees.ttl")
print(f"Loaded {result.triples_added} triples")
Step 2: Infer schema¶
The output contains candidate @node_type declarations with confidence
scores.
Step 3: Review and edit¶
Open models/inferred.py, review the confidence markers, fix any
TODO: verify fields, remove noise, and adjust types:
# Before (generated):
@node_type("Employee", fields={
"name": str,
"department": str,
"salary": int,
"manager": str, # TODO: verify | confidence: 0.72
})
class Employee:
pass
# After (reviewed):
@node_type("Employee", fields={
"name": str,
"department": str,
"salary": int,
"manager": "Employee | None", # optional self-reference
})
class Employee: ...
Step 4: Use the types¶
from models.inferred import Employee
@capability
def top_earners(ctx, min_salary: int) -> list:
return [
{"name": e.name, "salary": e.salary}
for e in Employee.where(salary__gte=min_salary).fetch(ctx)
]
Step 5: Refine over time (Phase 3)¶
As the app runs, trails onto refine observes query patterns and
validation results and suggests improvements — tightening constraints,
promoting untyped fields, flagging unused declarations:
from trails.onto_refine import UsageCollector, SchemaAnalyzer
collector = UsageCollector()
collector.start()
# ... app runs for a while ...
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze()
for s in report.suggestions:
print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason}")
Step 6: Monitor in the dashboard (Phase 4)¶
Open /admin/ontology to see current types, run inference, review
suggestions, and generate new types from the browser — all without
touching the CLI.
The ontology is a living document, not a static contract.
Reference¶
| Symbol | Description |
|---|---|
infer_schema(store, trace_id, *, min_instances, confidence) |
Phase 1: infer schema from store data; returns InferredSchema |
generate_code(schema) |
Phase 1: generate Python @node_type code from an InferredSchema |
InferredSchema |
.candidates, .total_triples_analyzed, .total_subjects_analyzed, .to_json() |
NodeTypeCandidate |
.name, .source_type_iri, .properties, .instance_count, .confidence |
PropertyCandidate |
.name, .predicate_iri, .python_type, .required, .is_list, .is_ref, .ref_target, .confidence, .sample_values |
generate_schema(description, *, model, llm_client, max_types) |
Phase 2: LLM-assisted generation; returns GeneratedSchema |
refine_schema(code, feedback, *, model, llm_client) |
Phase 2: iterative LLM refinement of existing code |
estimate_cost(description, *, model) |
Phase 2: pre-execution cost estimate; returns CostEstimate |
GeneratedSchema |
.code, .types, .cost_usd, .model_used, .raw_response |
CostEstimate |
.prompt_tokens, .estimated_cost_usd, .model |
UsageCollector(max_events) |
Phase 3: accumulates usage stats from observability events |
UsageCollector.start() / .stop() |
Register/unregister the observer hook |
UsageCollector.record_write(node_type, fields, null_fields) |
Manual write event recording |
UsageCollector.record_query(node_type, fields) |
Manual query event recording |
UsageCollector.record_shacl_violation(node_type, field, constraint) |
Manual violation recording |
SchemaAnalyzer(collector) |
Phase 3: analyses usage stats and produces suggestions |
SchemaAnalyzer.analyze(min_samples, null_threshold, ...) |
Run analysis; returns RefineReport |
RefineReport |
.suggestions, .stats, .analysis_period, .to_json() |
Suggestion |
.kind, .node_type, .field, .reason, .confidence, .sample_count |
generate_migration_code(suggestions) |
Phase 3: generate reviewable Python migration snippets |
/admin/ontology routes |
Phase 4: admin dashboard for ontology evolution (FastAPI router) |
See also¶
- ADR-0025 — full auto-ontology design
- ActiveGraph ORM —
@node_type, the output format of inference - Shapes & Validation —
@shapefor SHACL-grade validation - RML Data Mapping — declarative data loading that feeds inference
- Document Ingestion — code extractors for unstructured text