ADR-0025: Auto-Ontology Generation — Data-First Schema Discovery¶

Status: Accepted (2026-04-19)
Date: 2026-04-17
Depends on: ADR-0012 (Cost as framework primitive), ADR-0017 (ActiveGraph ORM), ADR-0021 (Progressive enhancement)
Supersedes: —
Superseded by: —

Context¶

Traditional knowledge-graph development is schema-first: design the ontology, build the store, then validate incoming data against the schema. This creates a chicken-and-egg problem — you need domain expertise before you have data, but it is the data that reveals the domain. In practice, teams either (a) spend weeks in ontology workshops before writing any code, or (b) skip the ontology entirely and end up with an untyped property soup that becomes unmaintainable.

Trails' progressive enhancement model (ADR-0021) already dissolves the hard boundary: an app can start with bare labels, add @node_type declarations when patterns stabilise, layer on @shape for SHACL-grade validation, and promote to OWL when interop or reasoning is needed. Each step is additive — no mode switches, no migrations.

The natural next step is to close the loop: the framework observes the data that flows through it and suggests the ontology. Instead of "design first, then populate," the workflow becomes "populate first, then formalise." The framework is the ontology engineer.

Two developments make this practical now:

Statistical inference over the KG store is straightforward. Entities that share the same set of predicates are candidate node types. Value distributions reveal datatypes. Cardinality is observable. Co-occurrence reveals relationships. All of this is a deterministic SPARQL analysis — no LLM required, no cost.
LLM-assisted generation can bootstrap an ontology from a single natural-language description of the domain. The trails.llm client (M9) already provides cost-tracked, provider-agnostic completions. Feeding a domain description to a cheap model and parsing the output into @node_type + @shape code is a tractable code-generation problem.

Decision¶

Trails provides three complementary mechanisms for data-first schema discovery. All three produce regular @node_type / @shape Python code — no magic, no hidden state.

1. Schema Inference — `trails onto infer`¶

Analyses an existing KG store and infers candidate type declarations.

Inputs: a populated Oxigraph store (local file or SPARQL endpoint).

Process (deterministic, zero LLM cost):

Signal	Inference
Entities sharing the same predicate set	Candidate `@node_type` label
Value distributions (all integers, all URIs, all dates)	Property datatype (`xsd:integer`, object property, `xsd:dateTime`)
Predicate cardinality across instances (always exactly 1, 0–N)	`required=True` vs `list[T]`
Shared object IRIs across entity groups	Candidate relationship edges
Predicate co-occurrence clusters	Candidate shape groupings

Output: a Python module containing @node_type-decorated classes with typed fields, plus optional @shape decorators when the data justifies constraints (e.g. observed min_value / max_value ranges, enumeration sets). The output includes confidence annotations as comments (e.g. # confidence: 0.93 — 14/15 instances have this field).

CLI surface:

# Infer from the default project store
trails onto infer

# Infer from a specific store / endpoint
trails onto infer --store /path/to/oxigraph.db
trails onto infer --endpoint http://localhost:7878/query

# Write output to a module instead of stdout
trails onto infer -o models/inferred.py

2. LLM-Assisted Generation — `trails onto generate`¶

Natural language in, @node_type code out.

Inputs: a domain description in plain text.

Process (non-deterministic, LLM cost reported upfront):

The user provides a domain description, e.g. "I'm building a clinical trial management system with patients, studies, sites, and adverse events."
The CLI estimates token cost based on description length and the configured model, reports it, and asks for confirmation.
trails.llm sends the description to the cheapest adequate model (Haiku-class for structural scaffolding; Sonnet-class when the description demands domain nuance — selectable via --model).
The LLM generates candidate @node_type definitions, relationship fields, and constraint annotations.
The CLI parses the LLM output, validates it against the Trails ORM surface, and writes a clean Python module.

Output: a Python module with @node_type classes, typed fields, relationship references, and @shape decorators where the LLM suggested constraints.

CLI surface:

# Interactive: prompt for domain description
trails onto generate

# One-shot from a file
trails onto generate --from domain.txt

# Specify model tier
trails onto generate --model haiku   # cheap, structural
trails onto generate --model sonnet  # domain-aware

# Dry run: show estimated cost without calling the LLM
trails onto generate --from domain.txt --dry-run

# Iterative refinement: feed back a previous output
trails onto generate --from domain.txt --refine models/v1.py

Cost envelope: every trails onto generate call opens a CostTracker envelope tagged onto:generate per ADR-0012. The estimated and actual token costs are printed at the end. No call is made without user confirmation unless --yes is passed.

3. Usage-Driven Refinement — `trails onto refine`¶

Observes runtime patterns and suggests schema improvements.

Inputs: query logs, SHACL validation logs, and write-path statistics collected by the framework's observability hooks.

Signals and suggestions:

Observation	Suggestion
Field X is null in >90% of instances	"Consider making `X` optional (`min_count=0`)"
Field Y violates its SHACL constraint in >10% of writes	"Constraint may be too strict — review `Y`'s `@shape` bounds"
Node types A, B, C always appear together in queries	"Consider a composite type or a relationship grouping"
Field Z is queried in >80% of read paths but has no type declaration	"Consider promoting `Z` to a typed field for validation"
New predicates appear that match no declared type	"Undeclared predicate `foo:bar` seen on 47 instances — candidate for typing"

Output: a diagnostic report surfaced via trails doctor, with actionable suggestions and the code diff that would implement each one.

CLI surface:

# Analyse usage patterns and print suggestions
trails onto refine

# Show suggestions in trails doctor output
trails doctor   # includes onto-refine section when data is available

# Apply a specific suggestion (interactive confirmation)
trails onto refine --apply suggestion-id

Progressive, not blocking: refinement runs as a background analysis. It never blocks writes or queries. Suggestions accumulate and are surfaced on demand.

Progressive enhancement integration¶

The three mechanisms map cleanly onto the progressive enhancement levels from ADR-0021:

Level	Mechanism	How it works
0	Manual `@node_type`	Status quo — hand-written type declarations
1	`trails onto infer`	Derive types from existing data
2	`trails onto generate`	Bootstrap types from natural language
3	`trails onto refine`	Evolve types from usage patterns
4	Continuous evolution	The framework suggests schema changes as data evolves (combines 1–3)

Each level is additive. A project may use any combination. Generated code is indistinguishable from hand-written code — the output IS the @node_type / @shape surface, not an abstraction on top of it.

Design principles¶

Never auto-apply. All three mechanisms suggest; the user confirms. No schema change is applied without explicit approval. trails onto refine --apply requires interactive confirmation (or --yes for scripted pipelines).
Output is plain code. Generated @node_type and @shape definitions are regular Python modules. No proprietary format, no hidden state, no runtime dependency on the generator. Delete the generator tomorrow; the output still works.
Deterministic vs non-deterministic — clearly labelled. infer is deterministic (statistical analysis of the store). generate is non-deterministic (LLM completion). The CLI labels each output: # Source: trails onto infer (deterministic) vs # Source: trails onto generate (LLM: claude-3-haiku, 2026-04-17).
Cost-aware. infer and refine are free (local SPARQL analysis). generate costs LLM tokens — the cost is estimated and reported before any API call, per ADR-0012. --dry-run shows the estimate without spending.
Reversible. If the ontology changes after adoption, trails onto evolve (the existing OntologyEvolution machinery) handles the migration. The generated code participates in the same diff/migrate workflow as hand-written code.
Composable. infer → generate → refine is a natural pipeline but not a required one. Each tool works standalone. A project may generate from a description, never infer. Another may infer from a legacy triple dump, never generate. A third may hand-write everything and only use refine for hygiene.

Non-goals¶

Automatic OWL reasoning from inferred schemas. Promoting inferred types to OWL classes with inference rules is M6 reasoner territory (ADR-0004). This ADR stops at @node_type + @shape.
Unsupervised schema changes in production. The framework suggests; a human (or an explicitly approved agent) applies. No silent migrations.
Replacing domain experts. The goal is augmentation, not replacement. A generated ontology is a starting point — faster than a blank page, not a substitute for domain review.
Cross-store federation. infer operates on a single store. Federated inference across multiple SPARQL endpoints is out of scope.

Consequences¶

Positive

Eliminates the blank-page problem. New projects get a working ontology in minutes instead of weeks.
Lowers the barrier for non-semweb users. "Describe your domain" is more accessible than "write SHACL shapes."
Reinforces ADR-0021's progressive enhancement: the framework actively helps users move from labels to types to shapes.
Creates a tight feedback loop: write data → infer types → validate with types → refine from validation → evolve. The ontology is a living document, not a static contract.

Negative

Inferred schemas may be wrong. Garbage data yields garbage types. The confidence annotations and mandatory human review mitigate this, but users may over-trust generated output.
LLM-generated code introduces non-determinism into the development workflow. Two runs of trails onto generate with the same input may produce different outputs. Pinning model + temperature + seed mitigates but does not eliminate this.
Additional CLI surface area to document and maintain.

Neutral

Storage model unchanged. Generated @node_type / @shape code uses the same Oxigraph store, the same IRI minting, the same validation pipeline as hand-written code.
No new runtime dependencies. infer and refine use SPARQL against the existing store. generate uses trails.llm, which is already a framework module.

Relationship to other ADRs¶

ADR	Relationship
ADR-0012 (Cost as framework primitive)	`generate` opens a cost envelope per call; `--dry-run` reports estimated cost.
ADR-0017 (ActiveGraph ORM)	`@node_type` is the output format for all three mechanisms.
ADR-0021 (Progressive enhancement)	Foundation. Auto-ontology is the next additive layer — the framework helping users progress through the enhancement levels.
ADR-0022 (Cedar unified matcher)	Unchanged. Generated types participate in strongest-available-type matching like any hand-written type.
ADR-0004 (Query-time reasoning)	Out of scope. OWL promotion from inferred types is deferred to M6.
ADR-0009 (Provenance always on)	`generate` emits a `prov:Activity` of type `trails:OntologyGeneration` when a context is available.

Alternatives considered¶

Schema-first only (status quo). Rejected. The blank-page problem is the single biggest onboarding friction point. Telling users "just write your ontology" is the semweb equivalent of "just write your SQL migrations" — technically correct, practically hostile.
Fully automatic ontology management. Rejected. Removing the human from the loop invites silent schema drift and breaks the trust model. The framework suggests; the human decides.
External ontology-learning tool (e.g. Text2Onto, OntoLearn). Rejected. These tools produce OWL/RDFS output that then needs to be translated back into Trails' @node_type / @shape surface. Keeping inference inside the framework means the output IS the framework's native surface — no translation layer, no impedance mismatch.
LLM-only generation (no statistical inference). Rejected. LLM generation is non-deterministic and costs money. Statistical inference over an existing store is deterministic and free. Both have value; neither subsumes the other.

Open questions¶

Should trails onto infer support incremental mode (analyse only new triples since last run), or always re-analyse the full store? Recommendation: full-store for v1; incremental as a performance optimisation when stores exceed ~100k triples.
Should trails onto generate support multi-turn refinement (chat-style back-and-forth with the LLM), or only single-shot generation with a --refine flag? Recommendation: --refine for v1; multi-turn adds session state complexity that is not justified until user demand is proven.
What confidence threshold should infer use before including a candidate type in the output? Recommendation: 0.7 default, configurable via --min-confidence. Fields below threshold are still emitted but commented out with their confidence score.

ADR-0025: Auto-Ontology Generation — Data-First Schema Discovery¶

Context¶

Decision¶

1. Schema Inference — trails onto infer¶

2. LLM-Assisted Generation — trails onto generate¶

3. Usage-Driven Refinement — trails onto refine¶

Progressive enhancement integration¶

Design principles¶

Non-goals¶

Consequences¶

Relationship to other ADRs¶

Alternatives considered¶

Open questions¶

1. Schema Inference — `trails onto infer`¶

2. LLM-Assisted Generation — `trails onto generate`¶

3. Usage-Driven Refinement — `trails onto refine`¶