Auto-Ontology¶

Traditional knowledge-graph development is schema-first: design the ontology, build the store, then validate incoming data against the schema. Trails flips this. The trails.onto_infer module analyses data already in your store and infers candidate @node_type declarations — deterministic, zero LLM cost, pure SPARQL statistical analysis. The full design lives in ADR-0025; the progressive-enhancement framing that keeps auto-ontology additive is ADR-0021.

The paradigm shift: data-first¶

The classic workflow is:

design ontology → build store → validate data → ship

The Trails workflow is:

load data → infer types → refine schema → ship

Instead of spending weeks in ontology workshops before having any data, you populate the graph first. The framework observes the data and suggests the schema. Generated @node_type code is indistinguishable from hand-written code — plain Python, no hidden state, no runtime dependency on the generator.

Three mechanisms support this workflow, adopted progressively:

Phase	Mechanism	Cost	Status
1	`trails onto infer` — statistical schema inference	Free (SPARQL)	Implemented
2	`trails onto generate` — LLM-assisted generation	LLM tokens	Implemented
3	`trails onto refine` — usage-driven refinement	Free (log analysis)	Implemented
4	Admin dashboard — ontology evolution UI	Free	Implemented

`trails onto infer` — schema inference from existing data¶

How it works¶

The inference engine runs four analysis passes over the store:

Type discovery. Find all distinct rdf:type values and their instance counts. Each type with enough instances becomes a candidate node type.
Predicate clustering. For each candidate type, collect every predicate used by its instances (excluding metadata predicates like rdf:type, rdfs:label, rdfs:comment). When no explicit rdf:type exists, subjects are clustered by Jaccard similarity of their predicate sets — entities that share >= 80% of their predicates are grouped as candidates.
Type inference. For each predicate, examine all values across instances:
XSD datatype annotations are respected first (xsd:integer maps to int, xsd:dateTime to datetime, etc.).
When no annotation exists, heuristics apply: all-integer values → int, all-float → float, all-boolean → bool, all-IRI → reference.
Values that consistently point to instances of another candidate type are detected as reference fields.
Cardinality inference. Count values per subject per predicate:
If every instance has at least one value → required=True.
If any instance has more than one value → list[T].

CLI usage¶

# Infer from the default project store (prints to stdout)
trails onto infer

# Write output to a Python module
trails onto infer -o models/inferred.py

# Require at least 5 instances per candidate type
trails onto infer --min-instances 5

# Set minimum confidence threshold
trails onto infer --min-confidence 0.8

Python API¶

from trails.onto_infer import infer_schema, generate_code

# Infer schema from a store
schema = infer_schema(
    ctx.kg._store,
    trace_id="my-analysis",
    min_instances=3,
    confidence=0.8,
)

print(f"Analyzed {schema.total_triples_analyzed} triples")
print(f"Found {len(schema.candidates)} candidate types")

for candidate in schema.candidates:
    print(f"\n{candidate.name} ({candidate.instance_count} instances, "
          f"confidence: {candidate.confidence:.2f})")
    for prop in candidate.properties:
        marker = "required" if prop.required else "optional"
        print(f"  {prop.name}: {prop.python_type} ({marker}, "
              f"confidence: {prop.confidence:.2f})")

# Generate Python code from the schema
code = generate_code(schema)
print(code)

Reading the output¶

The generated code includes confidence annotations so you know which declarations to trust and which to review:

"""Auto-generated @node_type declarations.

Source: trails onto infer (deterministic)
"""
from __future__ import annotations

import datetime

from trails.orm import node_type

# 47 instance(s), confidence: 0.94
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
    "name": str,  # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
    "department": str,  # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
    "salary": int,  # confidence: 0.96 | e.g. '95000', '82000'
    "manager": str,  # TODO: verify | confidence: 0.72 | e.g. 'https://...'
})
class Employee:
    pass

Key markers to look for:

Marker	Meaning
`confidence: 1.00`	Every instance has this field — high trust
`confidence: 0.72`	Only 72% of instances have this field — review it
`TODO: verify`	Confidence below 0.8 — the field may be noise
`source rdf:type: <iri>`	The RDF type IRI the candidate was derived from
Instance count	More instances = more reliable inference

JSON output¶

The InferredSchema can be serialized for programmatic consumption:

schema = infer_schema(ctx.kg._store)
print(schema.to_json())

{
  "candidates": [
    {
      "name": "Employee",
      "source_type_iri": "https://myapp.example/Employee",
      "properties": [
        {
          "name": "name",
          "predicate_iri": "https://myapp.example/name",
          "python_type": "str",
          "required": true,
          "is_list": false,
          "is_ref": false,
          "ref_target": null,
          "confidence": 1.0,
          "sample_values": ["Alice", "Bob", "Carol"]
        }
      ],
      "instance_count": 47,
      "confidence": 0.94
    }
  ],
  "total_triples_analyzed": 1234,
  "total_subjects_analyzed": 89
}

`trails onto generate` — LLM-assisted generation (Phase 2)¶

Describe your domain in plain text and the framework produces @node_type code via an LLM. The output is indistinguishable from hand-written code — no magic, no hidden state, no runtime dependency on the generator.

CLI usage¶

# From a file
trails onto generate --from domain.txt

# Specify model tier
trails onto generate --model haiku   # cheap, structural scaffolding (~$0.001)
trails onto generate --model sonnet  # domain-aware nuance (~$0.01)

# Dry run: show estimated cost without calling the LLM
trails onto generate --from domain.txt --dry-run

# Write output to a Python module
trails onto generate --from domain.txt -o models/generated.py

# Iterative refinement on existing code
trails onto generate --from domain.txt --refine models/v1.py

Python API¶

from trails.onto_generate import generate_schema, refine_schema, estimate_cost

# Estimate cost before spending
est = estimate_cost(
    "Clinical trial management with patients, studies, sites, and adverse events.",
    model="haiku",
)
print(f"Model: {est.model}, ~{est.prompt_tokens} prompt tokens, "
      f"estimated cost: ${est.estimated_cost_usd:.6f}")

# Generate (requires an LLM client configured in trails.toml or passed explicitly)
result = generate_schema(
    "Clinical trial management with patients, studies, sites, "
    "and adverse events. Each study has multiple sites. Patients "
    "enroll at a site. Adverse events link to patients and studies.",
    model="haiku",
    max_types=10,
)

print(f"Generated {len(result.types)} types: {result.types}")
print(f"Cost: ${result.cost_usd:.6f} ({result.model_used})")
print()
print(result.code)

Model selection¶

Model	Use case	Approximate cost per call
`haiku` (default)	Structural scaffolding, simple domains	~$0.001
`sonnet`	Domain-aware naming, nuanced relationships	~$0.01
`opus`	Complex multi-domain ontologies	~$0.10

The default is haiku — cheapest adequate model per the framework's cost-awareness principle. Use sonnet when domain naming quality matters (e.g. medical, legal).

Refine existing generated code with natural-language feedback:

from trails.onto_generate import refine_schema

refined = refine_schema(
    code=result.code,
    feedback="Add a Randomization type linking Patient to StudyArm. "
             "Make patient.email optional.",
    model="haiku",
)
print(refined.code)

Output format¶

Every generated file carries a header identifying its source:

"""Auto-generated @node_type declarations.

Source: trails onto generate (LLM: claude-haiku-4-5, 2026-04-17)
Domain: Clinical trial management with patients, studies, sites, and adverse...
"""
from __future__ import annotations

import datetime

from trails.orm import node_type

@node_type("Site", fields={
    "name": str,
    "address": str,
    "principal_investigator": str,
})
class Site:
    """A clinical trial site where patients are enrolled."""
    pass

# ... more types ...

The code is plain Python. Edit it, commit it, import it — no runtime dependency on the generator.

`trails onto refine` — usage-driven refinement (Phase 3)¶

The SchemaAnalyzer observes runtime patterns — field writes, query access, SHACL validation failures, null rates — and suggests schema improvements. Deterministic analysis, zero LLM cost.

CLI usage¶

# Analyse usage patterns and print suggestions
trails onto refine

# Include onto-refine section in the doctor report
trails doctor

# Generate migration code for accepted suggestions
trails onto refine --apply

Architecture: UsageCollector + SchemaAnalyzer¶

Two components work together:

UsageCollector — a lightweight observer hook that accumulates statistics from kg_write, kg_query, and shacl_violation events. Does O(1) work per event. Thread-safe.
SchemaAnalyzer — runs a single pass over the accumulated counters and produces a RefineReport with actionable Suggestion objects.

from trails.onto_refine import UsageCollector, SchemaAnalyzer

# Start collecting usage data (register as observability hook)
collector = UsageCollector()
collector.start()

# ... your app runs, writes data, queries data ...

# Later: analyse and get suggestions
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(
    min_samples=10,        # skip types with fewer than 10 writes
    null_threshold=0.90,   # suggest optional if >90% null
    violation_threshold=0.10,  # flag constraints violated >10%
    co_query_threshold=0.80,   # suggest index if co-queried >80%
)

for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason} "
          f"(confidence: {s.confidence:.2f}, samples: {s.sample_count})")

# Stop collecting when done
collector.stop()

Suggestion types¶

Kind	Trigger	Example
`make_optional`	Field null in >90% of writes	"Field 'nickname' is null in 95% of 200 writes"
`remove_field`	Field written but never queried	"Field 'internal_code' written 150 times but never queried"
`relax_constraint`	SHACL constraint violated in >10% of writes	"Constraint 'maxLength' on 'name' violated in 12% of writes"
`add_index`	Fields queried together in >80% of cases	"Fields [name, department] co-queried in 85% of cases"
`add_relationship`	Node types always co-occur in queries	"Types [Patient, Study] co-occur in 50 queries"

Manual event recording (testing)¶

For testing or external data, record events manually without the observability hook:

collector = UsageCollector()

# Simulate writes
collector.record_write("Employee", fields=["name", "salary"])
collector.record_write("Employee", fields=["name"], null_fields=["salary"])

# Simulate queries
collector.record_query("Employee", fields=["name", "department"])

# Simulate SHACL violations
collector.record_shacl_violation("Employee", "salary", "minInclusive")

# Analyse
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(min_samples=1)

Migration code generation¶

Generate reviewable Python migration snippets from accepted suggestions:

from trails.onto_refine import generate_migration_code

code = generate_migration_code(report.suggestions)
print(code)

Output:

"""Migration code generated by `trails onto refine --apply`.

Review each change before applying.
"""
from __future__ import annotations

# --- Suggestion 1: make_optional ---
# Node type: Employee
# Field: nickname
# Reason: Field 'nickname' is null in 95% of 200 writes — consider making it optional.
# Confidence: 0.95

# In your @node_type/Employee definition:
# Change: "nickname": str
# To:     "nickname": str | None
# Or in @shape: predicate(..., required=False)

Refinement is progressive and non-blocking. Suggestions accumulate and are surfaced on demand. No schema change is applied without explicit human approval.

Admin dashboard for ontology evolution (Phase 4)¶

The trails-admin UI includes an ontology dashboard at /admin/ontology that ties together all three mechanisms (infer, generate, refine) in a single browser interface.

Routes¶

Method	Path	Description
`GET`	`/admin/ontology`	Overview: current `@node_type` definitions, last inference, active suggestions, generation link
`POST`	`/admin/ontology/infer`	Trigger a deterministic inference run against the kernel store
`GET`	`/admin/ontology/generate`	Form for LLM-assisted generation with model selection and cost estimation
`POST`	`/admin/ontology/generate`	Submit a generation request (estimate or generate)
`GET`	`/admin/ontology/suggestions`	List refinement suggestions with confidence bars

Overview page¶

The overview page shows four sections:

Current @node_type definitions — lists all registered types with their IRI, fields, and inheritance.
Schema Inference — displays the last inference run's results (triple count, candidates, confidence). A "Run inference" button triggers a new run.
Refinement Suggestions — shows the count of active suggestions from the UsageCollector with a link to the full list.
LLM-Assisted Generation — links to the generation form.

Generation form¶

The generation form accepts a domain description and model selection (Haiku or Sonnet). Two actions:

Estimate cost — shows prompt token count and estimated USD without calling the LLM.
Generate — calls the LLM and displays the generated @node_type code inline.

Suggestions page¶

Displays all refinement suggestions in a table with columns for kind, node type, field, confidence (rendered as a colored bar), reason, and sample count. This is the same data as trails onto refine but rendered in the browser.

Mounting the dashboard¶

from fastapi import FastAPI
from trails_admin.routes.ontology import router

app = FastAPI()
app.include_router(router)
# Dashboard available at /admin/ontology

The progressive story¶

Here is the complete workflow from raw data to a mature ontology:

Step 1: Load CSV data¶

from trails.rml import run_mapping

result = run_mapping(ctx, "mappings/employees.ttl")
print(f"Loaded {result.triples_added} triples")

Step 2: Infer schema¶

trails onto infer -o models/inferred.py

The output contains candidate @node_type declarations with confidence scores.

Step 3: Review and edit¶

Open models/inferred.py, review the confidence markers, fix any TODO: verify fields, remove noise, and adjust types:

# Before (generated):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": str,  # TODO: verify | confidence: 0.72
})
class Employee:
    pass

# After (reviewed):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": "Employee | None",  # optional self-reference
})
class Employee: ...

Step 4: Use the types¶

from models.inferred import Employee

@capability
def top_earners(ctx, min_salary: int) -> list:
    return [
        {"name": e.name, "salary": e.salary}
        for e in Employee.where(salary__gte=min_salary).fetch(ctx)
    ]

Step 5: Refine over time (Phase 3)¶

As the app runs, trails onto refine observes query patterns and validation results and suggests improvements — tightening constraints, promoting untyped fields, flagging unused declarations:

from trails.onto_refine import UsageCollector, SchemaAnalyzer

collector = UsageCollector()
collector.start()

# ... app runs for a while ...

analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze()
for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason}")

Step 6: Monitor in the dashboard (Phase 4)¶

Open /admin/ontology to see current types, run inference, review suggestions, and generate new types from the browser — all without touching the CLI.

The ontology is a living document, not a static contract.

Reference¶

Symbol	Description
`infer_schema(store, trace_id, *, min_instances, confidence)`	Phase 1: infer schema from store data; returns `InferredSchema`
`generate_code(schema)`	Phase 1: generate Python `@node_type` code from an `InferredSchema`
`InferredSchema`	`.candidates`, `.total_triples_analyzed`, `.total_subjects_analyzed`, `.to_json()`
`NodeTypeCandidate`	`.name`, `.source_type_iri`, `.properties`, `.instance_count`, `.confidence`
`PropertyCandidate`	`.name`, `.predicate_iri`, `.python_type`, `.required`, `.is_list`, `.is_ref`, `.ref_target`, `.confidence`, `.sample_values`
`generate_schema(description, *, model, llm_client, max_types)`	Phase 2: LLM-assisted generation; returns `GeneratedSchema`
`refine_schema(code, feedback, *, model, llm_client)`	Phase 2: iterative LLM refinement of existing code
`estimate_cost(description, *, model)`	Phase 2: pre-execution cost estimate; returns `CostEstimate`
`GeneratedSchema`	`.code`, `.types`, `.cost_usd`, `.model_used`, `.raw_response`
`CostEstimate`	`.prompt_tokens`, `.estimated_cost_usd`, `.model`
`UsageCollector(max_events)`	Phase 3: accumulates usage stats from observability events
`UsageCollector.start()` / `.stop()`	Register/unregister the observer hook
`UsageCollector.record_write(node_type, fields, null_fields)`	Manual write event recording
`UsageCollector.record_query(node_type, fields)`	Manual query event recording
`UsageCollector.record_shacl_violation(node_type, field, constraint)`	Manual violation recording
`SchemaAnalyzer(collector)`	Phase 3: analyses usage stats and produces suggestions
`SchemaAnalyzer.analyze(min_samples, null_threshold, ...)`	Run analysis; returns `RefineReport`
`RefineReport`	`.suggestions`, `.stats`, `.analysis_period`, `.to_json()`
`Suggestion`	`.kind`, `.node_type`, `.field`, `.reason`, `.confidence`, `.sample_count`
`generate_migration_code(suggestions)`	Phase 3: generate reviewable Python migration snippets
`/admin/ontology` routes	Phase 4: admin dashboard for ontology evolution (FastAPI router)

Auto-Ontology¶

The paradigm shift: data-first¶

`trails onto infer` — schema inference from existing data¶

How it works¶

CLI usage¶

Python API¶

Reading the output¶

JSON output¶

`trails onto generate` — LLM-assisted generation (Phase 2)¶

CLI usage¶

Python API¶

Model selection¶

Iterative refinement¶

Output format¶

`trails onto refine` — usage-driven refinement (Phase 3)¶

CLI usage¶

Architecture: UsageCollector + SchemaAnalyzer¶

Suggestion types¶

Manual event recording (testing)¶

Migration code generation¶

Admin dashboard for ontology evolution (Phase 4)¶

Routes¶

Overview page¶

Generation form¶

Suggestions page¶

Mounting the dashboard¶

The progressive story¶

Step 1: Load CSV data¶

Step 2: Infer schema¶

Step 3: Review and edit¶

Step 4: Use the types¶

Step 5: Refine over time (Phase 3)¶

Step 6: Monitor in the dashboard (Phase 4)¶

Reference¶

See also¶

Auto-Ontology¶

The paradigm shift: data-first¶

trails onto infer — schema inference from existing data¶

How it works¶

CLI usage¶

Python API¶

Reading the output¶

JSON output¶

trails onto generate — LLM-assisted generation (Phase 2)¶

CLI usage¶

Python API¶

Model selection¶

Iterative refinement¶

Output format¶

trails onto refine — usage-driven refinement (Phase 3)¶

CLI usage¶

Architecture: UsageCollector + SchemaAnalyzer¶

Suggestion types¶

Manual event recording (testing)¶

Migration code generation¶

Admin dashboard for ontology evolution (Phase 4)¶

Routes¶

Overview page¶

Generation form¶

Suggestions page¶

Mounting the dashboard¶

The progressive story¶

Step 1: Load CSV data¶

Step 2: Infer schema¶

Step 3: Review and edit¶

Step 4: Use the types¶

Step 5: Refine over time (Phase 3)¶

Step 6: Monitor in the dashboard (Phase 4)¶

Reference¶

See also¶

`trails onto infer` — schema inference from existing data¶

`trails onto generate` — LLM-assisted generation (Phase 2)¶

`trails onto refine` — usage-driven refinement (Phase 3)¶