Skip to content

Auto-Ontology

Traditional knowledge-graph development is schema-first: design the ontology, build the store, then validate incoming data against the schema. Trails flips this. The trails.onto_infer module analyses data already in your store and infers candidate @node_type declarations — deterministic, zero LLM cost, pure SPARQL statistical analysis. The full design lives in ADR-0025; the progressive-enhancement framing that keeps auto-ontology additive is ADR-0021.

The paradigm shift: data-first

The classic workflow is:

design ontology → build store → validate data → ship

The Trails workflow is:

load data → infer types → refine schema → ship

Instead of spending weeks in ontology workshops before having any data, you populate the graph first. The framework observes the data and suggests the schema. Generated @node_type code is indistinguishable from hand-written code — plain Python, no hidden state, no runtime dependency on the generator.

Three mechanisms support this workflow, adopted progressively:

Phase Mechanism Cost Status
1 trails onto infer — statistical schema inference Free (SPARQL) Implemented
2 trails onto generate — LLM-assisted generation LLM tokens Implemented
3 trails onto refine — usage-driven refinement Free (log analysis) Implemented
4 Admin dashboard — ontology evolution UI Free Implemented

trails onto infer — schema inference from existing data

How it works

The inference engine runs four analysis passes over the store:

  1. Type discovery. Find all distinct rdf:type values and their instance counts. Each type with enough instances becomes a candidate node type.

  2. Predicate clustering. For each candidate type, collect every predicate used by its instances (excluding metadata predicates like rdf:type, rdfs:label, rdfs:comment). When no explicit rdf:type exists, subjects are clustered by Jaccard similarity of their predicate sets — entities that share >= 80% of their predicates are grouped as candidates.

  3. Type inference. For each predicate, examine all values across instances:

  4. XSD datatype annotations are respected first (xsd:integer maps to int, xsd:dateTime to datetime, etc.).
  5. When no annotation exists, heuristics apply: all-integer values → int, all-float → float, all-boolean → bool, all-IRI → reference.
  6. Values that consistently point to instances of another candidate type are detected as reference fields.

  7. Cardinality inference. Count values per subject per predicate:

  8. If every instance has at least one value → required=True.
  9. If any instance has more than one value → list[T].

CLI usage

# Infer from the default project store (prints to stdout)
trails onto infer

# Write output to a Python module
trails onto infer -o models/inferred.py

# Require at least 5 instances per candidate type
trails onto infer --min-instances 5

# Set minimum confidence threshold
trails onto infer --min-confidence 0.8

Python API

from trails.onto_infer import infer_schema, generate_code

# Infer schema from a store
schema = infer_schema(
    ctx.kg._store,
    trace_id="my-analysis",
    min_instances=3,
    confidence=0.8,
)

print(f"Analyzed {schema.total_triples_analyzed} triples")
print(f"Found {len(schema.candidates)} candidate types")

for candidate in schema.candidates:
    print(f"\n{candidate.name} ({candidate.instance_count} instances, "
          f"confidence: {candidate.confidence:.2f})")
    for prop in candidate.properties:
        marker = "required" if prop.required else "optional"
        print(f"  {prop.name}: {prop.python_type} ({marker}, "
              f"confidence: {prop.confidence:.2f})")

# Generate Python code from the schema
code = generate_code(schema)
print(code)

Reading the output

The generated code includes confidence annotations so you know which declarations to trust and which to review:

"""Auto-generated @node_type declarations.

Source: trails onto infer (deterministic)
"""
from __future__ import annotations

import datetime

from trails.orm import node_type

# 47 instance(s), confidence: 0.94
# source rdf:type: https://myapp.example/Employee
@node_type("Employee", fields={
    "name": str,  # confidence: 1.00 | e.g. 'Alice', 'Bob', 'Carol'
    "department": str,  # confidence: 1.00 | e.g. 'Engineering', 'Marketing'
    "salary": int,  # confidence: 0.96 | e.g. '95000', '82000'
    "manager": str,  # TODO: verify | confidence: 0.72 | e.g. 'https://...'
})
class Employee:
    pass

Key markers to look for:

Marker Meaning
confidence: 1.00 Every instance has this field — high trust
confidence: 0.72 Only 72% of instances have this field — review it
TODO: verify Confidence below 0.8 — the field may be noise
source rdf:type: <iri> The RDF type IRI the candidate was derived from
Instance count More instances = more reliable inference

JSON output

The InferredSchema can be serialized for programmatic consumption:

schema = infer_schema(ctx.kg._store)
print(schema.to_json())
{
  "candidates": [
    {
      "name": "Employee",
      "source_type_iri": "https://myapp.example/Employee",
      "properties": [
        {
          "name": "name",
          "predicate_iri": "https://myapp.example/name",
          "python_type": "str",
          "required": true,
          "is_list": false,
          "is_ref": false,
          "ref_target": null,
          "confidence": 1.0,
          "sample_values": ["Alice", "Bob", "Carol"]
        }
      ],
      "instance_count": 47,
      "confidence": 0.94
    }
  ],
  "total_triples_analyzed": 1234,
  "total_subjects_analyzed": 89
}

trails onto generate — LLM-assisted generation (Phase 2)

Describe your domain in plain text and the framework produces @node_type code via an LLM. The output is indistinguishable from hand-written code — no magic, no hidden state, no runtime dependency on the generator.

CLI usage

# From a file
trails onto generate --from domain.txt

# Specify model tier
trails onto generate --model haiku   # cheap, structural scaffolding (~$0.001)
trails onto generate --model sonnet  # domain-aware nuance (~$0.01)

# Dry run: show estimated cost without calling the LLM
trails onto generate --from domain.txt --dry-run

# Write output to a Python module
trails onto generate --from domain.txt -o models/generated.py

# Iterative refinement on existing code
trails onto generate --from domain.txt --refine models/v1.py

Python API

from trails.onto_generate import generate_schema, refine_schema, estimate_cost

# Estimate cost before spending
est = estimate_cost(
    "Clinical trial management with patients, studies, sites, and adverse events.",
    model="haiku",
)
print(f"Model: {est.model}, ~{est.prompt_tokens} prompt tokens, "
      f"estimated cost: ${est.estimated_cost_usd:.6f}")

# Generate (requires an LLM client configured in trails.toml or passed explicitly)
result = generate_schema(
    "Clinical trial management with patients, studies, sites, "
    "and adverse events. Each study has multiple sites. Patients "
    "enroll at a site. Adverse events link to patients and studies.",
    model="haiku",
    max_types=10,
)

print(f"Generated {len(result.types)} types: {result.types}")
print(f"Cost: ${result.cost_usd:.6f} ({result.model_used})")
print()
print(result.code)

Model selection

Model Use case Approximate cost per call
haiku (default) Structural scaffolding, simple domains ~$0.001
sonnet Domain-aware naming, nuanced relationships ~$0.01
opus Complex multi-domain ontologies ~$0.10

The default is haiku — cheapest adequate model per the framework's cost-awareness principle. Use sonnet when domain naming quality matters (e.g. medical, legal).

Iterative refinement

Refine existing generated code with natural-language feedback:

from trails.onto_generate import refine_schema

refined = refine_schema(
    code=result.code,
    feedback="Add a Randomization type linking Patient to StudyArm. "
             "Make patient.email optional.",
    model="haiku",
)
print(refined.code)

Output format

Every generated file carries a header identifying its source:

"""Auto-generated @node_type declarations.

Source: trails onto generate (LLM: claude-haiku-4-5, 2026-04-17)
Domain: Clinical trial management with patients, studies, sites, and adverse...
"""
from __future__ import annotations

import datetime

from trails.orm import node_type

@node_type("Site", fields={
    "name": str,
    "address": str,
    "principal_investigator": str,
})
class Site:
    """A clinical trial site where patients are enrolled."""
    pass

# ... more types ...

The code is plain Python. Edit it, commit it, import it — no runtime dependency on the generator.

trails onto refine — usage-driven refinement (Phase 3)

The SchemaAnalyzer observes runtime patterns — field writes, query access, SHACL validation failures, null rates — and suggests schema improvements. Deterministic analysis, zero LLM cost.

CLI usage

# Analyse usage patterns and print suggestions
trails onto refine

# Include onto-refine section in the doctor report
trails doctor

# Generate migration code for accepted suggestions
trails onto refine --apply

Architecture: UsageCollector + SchemaAnalyzer

Two components work together:

  1. UsageCollector — a lightweight observer hook that accumulates statistics from kg_write, kg_query, and shacl_violation events. Does O(1) work per event. Thread-safe.

  2. SchemaAnalyzer — runs a single pass over the accumulated counters and produces a RefineReport with actionable Suggestion objects.

from trails.onto_refine import UsageCollector, SchemaAnalyzer

# Start collecting usage data (register as observability hook)
collector = UsageCollector()
collector.start()

# ... your app runs, writes data, queries data ...

# Later: analyse and get suggestions
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(
    min_samples=10,        # skip types with fewer than 10 writes
    null_threshold=0.90,   # suggest optional if >90% null
    violation_threshold=0.10,  # flag constraints violated >10%
    co_query_threshold=0.80,   # suggest index if co-queried >80%
)

for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason} "
          f"(confidence: {s.confidence:.2f}, samples: {s.sample_count})")

# Stop collecting when done
collector.stop()

Suggestion types

Kind Trigger Example
make_optional Field null in >90% of writes "Field 'nickname' is null in 95% of 200 writes"
remove_field Field written but never queried "Field 'internal_code' written 150 times but never queried"
relax_constraint SHACL constraint violated in >10% of writes "Constraint 'maxLength' on 'name' violated in 12% of writes"
add_index Fields queried together in >80% of cases "Fields [name, department] co-queried in 85% of cases"
add_relationship Node types always co-occur in queries "Types [Patient, Study] co-occur in 50 queries"

Manual event recording (testing)

For testing or external data, record events manually without the observability hook:

collector = UsageCollector()

# Simulate writes
collector.record_write("Employee", fields=["name", "salary"])
collector.record_write("Employee", fields=["name"], null_fields=["salary"])

# Simulate queries
collector.record_query("Employee", fields=["name", "department"])

# Simulate SHACL violations
collector.record_shacl_violation("Employee", "salary", "minInclusive")

# Analyse
analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze(min_samples=1)

Migration code generation

Generate reviewable Python migration snippets from accepted suggestions:

from trails.onto_refine import generate_migration_code

code = generate_migration_code(report.suggestions)
print(code)

Output:

"""Migration code generated by `trails onto refine --apply`.

Review each change before applying.
"""
from __future__ import annotations

# --- Suggestion 1: make_optional ---
# Node type: Employee
# Field: nickname
# Reason: Field 'nickname' is null in 95% of 200 writes — consider making it optional.
# Confidence: 0.95

# In your @node_type/Employee definition:
# Change: "nickname": str
# To:     "nickname": str | None
# Or in @shape: predicate(..., required=False)

Refinement is progressive and non-blocking. Suggestions accumulate and are surfaced on demand. No schema change is applied without explicit human approval.

Admin dashboard for ontology evolution (Phase 4)

The trails-admin UI includes an ontology dashboard at /admin/ontology that ties together all three mechanisms (infer, generate, refine) in a single browser interface.

Routes

Method Path Description
GET /admin/ontology Overview: current @node_type definitions, last inference, active suggestions, generation link
POST /admin/ontology/infer Trigger a deterministic inference run against the kernel store
GET /admin/ontology/generate Form for LLM-assisted generation with model selection and cost estimation
POST /admin/ontology/generate Submit a generation request (estimate or generate)
GET /admin/ontology/suggestions List refinement suggestions with confidence bars

Overview page

The overview page shows four sections:

  1. Current @node_type definitions — lists all registered types with their IRI, fields, and inheritance.
  2. Schema Inference — displays the last inference run's results (triple count, candidates, confidence). A "Run inference" button triggers a new run.
  3. Refinement Suggestions — shows the count of active suggestions from the UsageCollector with a link to the full list.
  4. LLM-Assisted Generation — links to the generation form.

Generation form

The generation form accepts a domain description and model selection (Haiku or Sonnet). Two actions:

  • Estimate cost — shows prompt token count and estimated USD without calling the LLM.
  • Generate — calls the LLM and displays the generated @node_type code inline.

Suggestions page

Displays all refinement suggestions in a table with columns for kind, node type, field, confidence (rendered as a colored bar), reason, and sample count. This is the same data as trails onto refine but rendered in the browser.

Mounting the dashboard

from fastapi import FastAPI
from trails_admin.routes.ontology import router

app = FastAPI()
app.include_router(router)
# Dashboard available at /admin/ontology

The progressive story

Here is the complete workflow from raw data to a mature ontology:

Step 1: Load CSV data

from trails.rml import run_mapping

result = run_mapping(ctx, "mappings/employees.ttl")
print(f"Loaded {result.triples_added} triples")

Step 2: Infer schema

trails onto infer -o models/inferred.py

The output contains candidate @node_type declarations with confidence scores.

Step 3: Review and edit

Open models/inferred.py, review the confidence markers, fix any TODO: verify fields, remove noise, and adjust types:

# Before (generated):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": str,  # TODO: verify | confidence: 0.72
})
class Employee:
    pass

# After (reviewed):
@node_type("Employee", fields={
    "name": str,
    "department": str,
    "salary": int,
    "manager": "Employee | None",  # optional self-reference
})
class Employee: ...

Step 4: Use the types

from models.inferred import Employee

@capability
def top_earners(ctx, min_salary: int) -> list:
    return [
        {"name": e.name, "salary": e.salary}
        for e in Employee.where(salary__gte=min_salary).fetch(ctx)
    ]

Step 5: Refine over time (Phase 3)

As the app runs, trails onto refine observes query patterns and validation results and suggests improvements — tightening constraints, promoting untyped fields, flagging unused declarations:

from trails.onto_refine import UsageCollector, SchemaAnalyzer

collector = UsageCollector()
collector.start()

# ... app runs for a while ...

analyzer = SchemaAnalyzer(collector)
report = analyzer.analyze()
for s in report.suggestions:
    print(f"[{s.kind}] {s.node_type}.{s.field}: {s.reason}")

Step 6: Monitor in the dashboard (Phase 4)

Open /admin/ontology to see current types, run inference, review suggestions, and generate new types from the browser — all without touching the CLI.

The ontology is a living document, not a static contract.

Reference

Symbol Description
infer_schema(store, trace_id, *, min_instances, confidence) Phase 1: infer schema from store data; returns InferredSchema
generate_code(schema) Phase 1: generate Python @node_type code from an InferredSchema
InferredSchema .candidates, .total_triples_analyzed, .total_subjects_analyzed, .to_json()
NodeTypeCandidate .name, .source_type_iri, .properties, .instance_count, .confidence
PropertyCandidate .name, .predicate_iri, .python_type, .required, .is_list, .is_ref, .ref_target, .confidence, .sample_values
generate_schema(description, *, model, llm_client, max_types) Phase 2: LLM-assisted generation; returns GeneratedSchema
refine_schema(code, feedback, *, model, llm_client) Phase 2: iterative LLM refinement of existing code
estimate_cost(description, *, model) Phase 2: pre-execution cost estimate; returns CostEstimate
GeneratedSchema .code, .types, .cost_usd, .model_used, .raw_response
CostEstimate .prompt_tokens, .estimated_cost_usd, .model
UsageCollector(max_events) Phase 3: accumulates usage stats from observability events
UsageCollector.start() / .stop() Register/unregister the observer hook
UsageCollector.record_write(node_type, fields, null_fields) Manual write event recording
UsageCollector.record_query(node_type, fields) Manual query event recording
UsageCollector.record_shacl_violation(node_type, field, constraint) Manual violation recording
SchemaAnalyzer(collector) Phase 3: analyses usage stats and produces suggestions
SchemaAnalyzer.analyze(min_samples, null_threshold, ...) Run analysis; returns RefineReport
RefineReport .suggestions, .stats, .analysis_period, .to_json()
Suggestion .kind, .node_type, .field, .reason, .confidence, .sample_count
generate_migration_code(suggestions) Phase 3: generate reviewable Python migration snippets
/admin/ontology routes Phase 4: admin dashboard for ontology evolution (FastAPI router)

See also