ADR-0025: Auto-Ontology Generation — Data-First Schema Discovery¶
- Status: Accepted (2026-04-19)
- Date: 2026-04-17
- Depends on: ADR-0012 (Cost as framework primitive), ADR-0017 (ActiveGraph ORM), ADR-0021 (Progressive enhancement)
- Supersedes: —
- Superseded by: —
Context¶
Traditional knowledge-graph development is schema-first: design the ontology, build the store, then validate incoming data against the schema. This creates a chicken-and-egg problem — you need domain expertise before you have data, but it is the data that reveals the domain. In practice, teams either (a) spend weeks in ontology workshops before writing any code, or (b) skip the ontology entirely and end up with an untyped property soup that becomes unmaintainable.
Trails' progressive enhancement model (ADR-0021) already dissolves the
hard boundary: an app can start with bare labels, add @node_type
declarations when patterns stabilise, layer on @shape for SHACL-grade
validation, and promote to OWL when interop or reasoning is needed.
Each step is additive — no mode switches, no migrations.
The natural next step is to close the loop: the framework observes the data that flows through it and suggests the ontology. Instead of "design first, then populate," the workflow becomes "populate first, then formalise." The framework is the ontology engineer.
Two developments make this practical now:
-
Statistical inference over the KG store is straightforward. Entities that share the same set of predicates are candidate node types. Value distributions reveal datatypes. Cardinality is observable. Co-occurrence reveals relationships. All of this is a deterministic SPARQL analysis — no LLM required, no cost.
-
LLM-assisted generation can bootstrap an ontology from a single natural-language description of the domain. The
trails.llmclient (M9) already provides cost-tracked, provider-agnostic completions. Feeding a domain description to a cheap model and parsing the output into@node_type+@shapecode is a tractable code-generation problem.
Decision¶
Trails provides three complementary mechanisms for data-first schema
discovery. All three produce regular @node_type / @shape Python
code — no magic, no hidden state.
1. Schema Inference — trails onto infer¶
Analyses an existing KG store and infers candidate type declarations.
Inputs: a populated Oxigraph store (local file or SPARQL endpoint).
Process (deterministic, zero LLM cost):
| Signal | Inference |
|---|---|
| Entities sharing the same predicate set | Candidate @node_type label |
| Value distributions (all integers, all URIs, all dates) | Property datatype (xsd:integer, object property, xsd:dateTime) |
| Predicate cardinality across instances (always exactly 1, 0–N) | required=True vs list[T] |
| Shared object IRIs across entity groups | Candidate relationship edges |
| Predicate co-occurrence clusters | Candidate shape groupings |
Output: a Python module containing @node_type-decorated classes
with typed fields, plus optional @shape decorators when the data
justifies constraints (e.g. observed min_value / max_value ranges,
enumeration sets). The output includes confidence annotations as
comments (e.g. # confidence: 0.93 — 14/15 instances have this field).
CLI surface:
# Infer from the default project store
trails onto infer
# Infer from a specific store / endpoint
trails onto infer --store /path/to/oxigraph.db
trails onto infer --endpoint http://localhost:7878/query
# Write output to a module instead of stdout
trails onto infer -o models/inferred.py
2. LLM-Assisted Generation — trails onto generate¶
Natural language in, @node_type code out.
Inputs: a domain description in plain text.
Process (non-deterministic, LLM cost reported upfront):
- The user provides a domain description, e.g. "I'm building a clinical trial management system with patients, studies, sites, and adverse events."
- The CLI estimates token cost based on description length and the configured model, reports it, and asks for confirmation.
trails.llmsends the description to the cheapest adequate model (Haiku-class for structural scaffolding; Sonnet-class when the description demands domain nuance — selectable via--model).- The LLM generates candidate
@node_typedefinitions, relationship fields, and constraint annotations. - The CLI parses the LLM output, validates it against the Trails ORM surface, and writes a clean Python module.
Output: a Python module with @node_type classes, typed fields,
relationship references, and @shape decorators where the LLM
suggested constraints.
CLI surface:
# Interactive: prompt for domain description
trails onto generate
# One-shot from a file
trails onto generate --from domain.txt
# Specify model tier
trails onto generate --model haiku # cheap, structural
trails onto generate --model sonnet # domain-aware
# Dry run: show estimated cost without calling the LLM
trails onto generate --from domain.txt --dry-run
# Iterative refinement: feed back a previous output
trails onto generate --from domain.txt --refine models/v1.py
Cost envelope: every trails onto generate call opens a
CostTracker envelope tagged onto:generate per ADR-0012. The
estimated and actual token costs are printed at the end. No call is
made without user confirmation unless --yes is passed.
3. Usage-Driven Refinement — trails onto refine¶
Observes runtime patterns and suggests schema improvements.
Inputs: query logs, SHACL validation logs, and write-path statistics collected by the framework's observability hooks.
Signals and suggestions:
| Observation | Suggestion |
|---|---|
| Field X is null in >90% of instances | "Consider making X optional (min_count=0)" |
| Field Y violates its SHACL constraint in >10% of writes | "Constraint may be too strict — review Y's @shape bounds" |
| Node types A, B, C always appear together in queries | "Consider a composite type or a relationship grouping" |
| Field Z is queried in >80% of read paths but has no type declaration | "Consider promoting Z to a typed field for validation" |
| New predicates appear that match no declared type | "Undeclared predicate foo:bar seen on 47 instances — candidate for typing" |
Output: a diagnostic report surfaced via trails doctor, with
actionable suggestions and the code diff that would implement each one.
CLI surface:
# Analyse usage patterns and print suggestions
trails onto refine
# Show suggestions in trails doctor output
trails doctor # includes onto-refine section when data is available
# Apply a specific suggestion (interactive confirmation)
trails onto refine --apply suggestion-id
Progressive, not blocking: refinement runs as a background analysis. It never blocks writes or queries. Suggestions accumulate and are surfaced on demand.
Progressive enhancement integration¶
The three mechanisms map cleanly onto the progressive enhancement levels from ADR-0021:
| Level | Mechanism | How it works |
|---|---|---|
| 0 | Manual @node_type |
Status quo — hand-written type declarations |
| 1 | trails onto infer |
Derive types from existing data |
| 2 | trails onto generate |
Bootstrap types from natural language |
| 3 | trails onto refine |
Evolve types from usage patterns |
| 4 | Continuous evolution | The framework suggests schema changes as data evolves (combines 1–3) |
Each level is additive. A project may use any combination. Generated
code is indistinguishable from hand-written code — the output IS the
@node_type / @shape surface, not an abstraction on top of it.
Design principles¶
-
Never auto-apply. All three mechanisms suggest; the user confirms. No schema change is applied without explicit approval.
trails onto refine --applyrequires interactive confirmation (or--yesfor scripted pipelines). -
Output is plain code. Generated
@node_typeand@shapedefinitions are regular Python modules. No proprietary format, no hidden state, no runtime dependency on the generator. Delete the generator tomorrow; the output still works. -
Deterministic vs non-deterministic — clearly labelled.
inferis deterministic (statistical analysis of the store).generateis non-deterministic (LLM completion). The CLI labels each output:# Source: trails onto infer (deterministic)vs# Source: trails onto generate (LLM: claude-3-haiku, 2026-04-17). -
Cost-aware.
inferandrefineare free (local SPARQL analysis).generatecosts LLM tokens — the cost is estimated and reported before any API call, per ADR-0012.--dry-runshows the estimate without spending. -
Reversible. If the ontology changes after adoption,
trails onto evolve(the existingOntologyEvolutionmachinery) handles the migration. The generated code participates in the same diff/migrate workflow as hand-written code. -
Composable.
infer→generate→refineis a natural pipeline but not a required one. Each tool works standalone. A project maygeneratefrom a description, neverinfer. Another mayinferfrom a legacy triple dump, nevergenerate. A third may hand-write everything and only userefinefor hygiene.
Non-goals¶
- Automatic OWL reasoning from inferred schemas. Promoting inferred
types to OWL classes with inference rules is M6 reasoner territory
(ADR-0004). This ADR stops at
@node_type+@shape. - Unsupervised schema changes in production. The framework suggests; a human (or an explicitly approved agent) applies. No silent migrations.
- Replacing domain experts. The goal is augmentation, not replacement. A generated ontology is a starting point — faster than a blank page, not a substitute for domain review.
- Cross-store federation.
inferoperates on a single store. Federated inference across multiple SPARQL endpoints is out of scope.
Consequences¶
Positive
- Eliminates the blank-page problem. New projects get a working ontology in minutes instead of weeks.
- Lowers the barrier for non-semweb users. "Describe your domain" is more accessible than "write SHACL shapes."
- Reinforces ADR-0021's progressive enhancement: the framework actively helps users move from labels to types to shapes.
- Creates a tight feedback loop: write data → infer types → validate with types → refine from validation → evolve. The ontology is a living document, not a static contract.
Negative
- Inferred schemas may be wrong. Garbage data yields garbage types. The confidence annotations and mandatory human review mitigate this, but users may over-trust generated output.
- LLM-generated code introduces non-determinism into the development
workflow. Two runs of
trails onto generatewith the same input may produce different outputs. Pinning model + temperature + seed mitigates but does not eliminate this. - Additional CLI surface area to document and maintain.
Neutral
- Storage model unchanged. Generated
@node_type/@shapecode uses the same Oxigraph store, the same IRI minting, the same validation pipeline as hand-written code. - No new runtime dependencies.
inferandrefineuse SPARQL against the existing store.generateusestrails.llm, which is already a framework module.
Relationship to other ADRs¶
| ADR | Relationship |
|---|---|
| ADR-0012 (Cost as framework primitive) | generate opens a cost envelope per call; --dry-run reports estimated cost. |
| ADR-0017 (ActiveGraph ORM) | @node_type is the output format for all three mechanisms. |
| ADR-0021 (Progressive enhancement) | Foundation. Auto-ontology is the next additive layer — the framework helping users progress through the enhancement levels. |
| ADR-0022 (Cedar unified matcher) | Unchanged. Generated types participate in strongest-available-type matching like any hand-written type. |
| ADR-0004 (Query-time reasoning) | Out of scope. OWL promotion from inferred types is deferred to M6. |
| ADR-0009 (Provenance always on) | generate emits a prov:Activity of type trails:OntologyGeneration when a context is available. |
Alternatives considered¶
-
Schema-first only (status quo). Rejected. The blank-page problem is the single biggest onboarding friction point. Telling users "just write your ontology" is the semweb equivalent of "just write your SQL migrations" — technically correct, practically hostile.
-
Fully automatic ontology management. Rejected. Removing the human from the loop invites silent schema drift and breaks the trust model. The framework suggests; the human decides.
-
External ontology-learning tool (e.g. Text2Onto, OntoLearn). Rejected. These tools produce OWL/RDFS output that then needs to be translated back into Trails'
@node_type/@shapesurface. Keeping inference inside the framework means the output IS the framework's native surface — no translation layer, no impedance mismatch. -
LLM-only generation (no statistical inference). Rejected. LLM generation is non-deterministic and costs money. Statistical inference over an existing store is deterministic and free. Both have value; neither subsumes the other.
Open questions¶
- Should
trails onto infersupport incremental mode (analyse only new triples since last run), or always re-analyse the full store? Recommendation: full-store for v1; incremental as a performance optimisation when stores exceed ~100k triples. - Should
trails onto generatesupport multi-turn refinement (chat-style back-and-forth with the LLM), or only single-shot generation with a--refineflag? Recommendation:--refinefor v1; multi-turn adds session state complexity that is not justified until user demand is proven. - What confidence threshold should
inferuse before including a candidate type in the output? Recommendation: 0.7 default, configurable via--min-confidence. Fields below threshold are still emitted but commented out with their confidence score.