Skip to content

ADR-0019: App surface — ingestion, vector retrieval, admin UI

  • Status: Accepted (2026-04-19)
  • Date: 2026-04-14
  • Targets: M10 (propose v3.5.0)
  • Supersedes:
  • Superseded by:

Context

Trails positions itself as "Rails for agentic-semantic-web apps." The Rails analogy has concrete load-bearing pieces today — project layout, a Python-first shape surface (ADR-0002), rich capability manifests (ADR-0005), a default triple store (ADR-0007), always-on provenance (ADR-0009), cost as a primitive (ADR-0012), an ActiveGraph ORM (ADR-0017). The app layer is missing.

Concretely, the reference consumer — the reference compliance application (evidence graphs from documents; compliance audit trails) — needs three things on day one that Trails does not provide:

  1. A way to get documents into the KG. There is no trails.ingest. A compliance-shaped app today must hand-assemble pypdf / trafilatura / chunking / PROV-O emission for every corpus, and every team reinvents the pipeline differently.
  2. A way to retrieve by semantic similarity, combined with graph structure. Trails has SPARQL (via GraphStore.query) but no embedding layer, no vector index, no hybrid retrieval surface. "Find evidence supporting claim X, ranked by both citation graph position and textual similarity" is the defining query shape for the reference application and is currently impossible without a side-car stack.
  3. A usable operator / admin UI. The only UI surface in Trails today is prov_explorer.py — a narrow Python API over the prov: graph. There is no graph browser, no capability invocation form, no run inspector, no ingestion dashboard, no cost view. Every app builder reinvents a React app against http_adapter.py.

Rails shipped with scaffold, default views, and the Rails console out of the box. Without the three primitives above, Trails' Rails positioning is aspirational, not concrete.

Two framings were considered:

  1. Depend on LlamaIndex (or similar) for ingestion + retrieval; no admin UI in core. Pragmatic, ships fast, but pulls in opinions about PROV-O that conflict with ADR-0009, and outsources the thing that differentiates Trails (the graph-native surface) to a vector-first framework that treats the KG as a second-class store. Admin-UI gap remains.
  2. Build three thin, composable primitives inside Trails, each an adapter over existing best-of-breed libraries. Keeps provenance, cost, and shapes first-class; keeps the dep surface honest (pypdf, sqlite-vec, React — not an SDK ecosystem); lets the admin UI be auto-generated from what Trails already knows (shapes, capability manifest, provenance graph).

Decision

Adopt framing 2. Trails adds three composable app-builder primitives as a single coordinated surface, landing in M10.

1. trails.ingest — document ingestion pipeline

  • Module: python/src/trails/ingest/ (package: __init__.py, extractors.py, chunkers.py, loaders.py, pipeline.py).
  • Key surface:
  • trails.ingest.Pipeline(extractors=..., chunker=..., loader=...)
  • pipeline.run(source: Path | str | Iterable[Path]) -> IngestReport
  • IngestReport(documents: list[DocumentRef], errors: list[...], resume_token: str)
  • Extractors: PdfExtractor (pypdf + pdfplumber fallback), HtmlExtractor (trafilatura), MarkdownExtractor, PlainTextExtractor, DocxExtractor (python-docx).
  • Chunkers: ParagraphChunker, TokenChunker(max_tokens=...), SectionChunker (heading-aware).
  • Loader: KGLoader(store, graph=...) writes schema:TextDigitalDocument + child chunks linked via schema:hasPart, returning IRIs minted per ADR-0003.
  • Semantics: each ingestion run is a single prov:Activity (ADR-0009), with the source file(s) as prov:used entities and each produced document/chunk as prov:generated. Failures are retryable from the returned resume_token — partial runs never corrupt the KG because the loader writes per-document in a transaction and records a checkpoint before moving to the next source.
  • Dependencies: pypdf, pdfplumber, trafilatura, python-docx, markdown-it-py, tiktoken (token counting). All optional via extras; missing extras surface as clear MissingExtractor errors, never silent skips.
  • ADR composition:
  • ADR-0002 (shapes): Document and Chunk are published as canonical shapes in the trails.ingest namespace; users subclass with @shape(extends=["trails:Document"]) for domain-specific metadata.
  • ADR-0009 (provenance): always-on; ingestion cannot turn it off.
  • ADR-0012 (cost): ingestion reports per-document bytes processed and wall-clock; no LLM cost unless an embedder is attached in phase 2.

2. trails.vector — embeddings + vector retrieval

  • Module: python/src/trails/vector/ (__init__.py, embedders.py, stores.py, retrieve.py).
  • Key surface:
  • trails.vector.Embedder — abstract; built-in SentenceTransformersEmbedder (local), OpenAIEmbedder (API), NullEmbedder (tests).
  • trails.vector.VectorStore — abstract; adapters: SqliteVecStore (zero-ops default), QdrantStore (scale), OxigraphLiteralStore (tiny apps; embeddings as typed literals on chunk IRIs).
  • trails.retrieve(query: str, *, mode: Literal["graph", "vector", "hybrid"] = "hybrid", k: int = 10, sparql_filter: str | None = None) -> list[RetrievalHit] — the single entry point; hybrid runs the SPARQL filter first, embeds the remaining candidate chunks' IRIs, and fuses by reciprocal-rank-fusion.
  • Semantics: vector retrieval is a capability (ADR-0005) with declared cost (token_estimate per query, USD per embedding batch) and declared side effects (reads: [ <app:text> ], writes: []). Embedding generation is itself a capability, producing PROV-O derivation edges from source chunk → embedding.
  • Dependencies: sqlite-vec (default), sentence-transformers (default local embedder), openai (optional, remote embedder), qdrant-client (optional, scale adapter). Oxigraph literal storage uses only core deps.
  • ADR composition:
  • ADR-0007 (Oxigraph default): OxigraphLiteralStore lets the zero-ops path stay zero-ops for tiny corpora (<100k chunks) without adding a second store.
  • ADR-0005 (rich manifest): retrieval capability declares input_shape = trails:RetrievalQuery, enabling agent-readable discovery.
  • ADR-0012 (cost): embedding cost is first-class, reported in the response envelope.
  • Not a new vector DB abstraction. Adapters are thin — we do not reimplement HNSW; we call the adapter's native API.

3. trails-admin — auto-generated admin UI (separate optional package)

  • Package: python/trails-admin/ (distinct distribution: pip install trails[admin] or pip install trails-admin).
  • Entry point: trails admin serve --app myapp:app --port 4455 — mounts a FastAPI sub-app on the existing http_adapter and serves a React SPA from trails_admin/static/.
  • Views (auto-generated from registry + shapes):
  • Graph browser. Faceted node/edge view; facets are derived from declared shapes (@shape registrations) and named graphs; SPARQL is never exposed in the default UI.
  • Capability invocation forms. JSON-Schema-driven, built from the input shape of each registered capability (ADR-0005 projection). Rails-scaffold semantics: list → show → new → edit → submit.
  • Capability run inspector. Per-run view showing PROV-O activity (ADR-0009), cost envelope (ADR-0012), structured log lines, and a replay button that re-invokes with identical inputs under a new run ID.
  • Ingestion dashboard. Live view of in-flight Pipeline runs, per-source progress, error list, resume-token capture.
  • Cost / budget / quota views. Fixed set — per-capability spend, per-principal spend, budget exhaustion alerts. Read- only in v1; budget mutation stays in the Python surface.
  • Stack: FastAPI (already present via http_adapter.py) hosts the REST surface; React + Vite + shadcn/ui for the SPA. The built bundle ships in the wheel — no Node required at install time. A Node toolchain is required only for contributors building the UI.
  • ADR composition:
  • ADR-0005: admin UI is a fourth projection of the canonical capability manifest (alongside MCP, OpenAPI, JSON-LD); the invocation form is generated from the same input_shape that MCP consumes.
  • ADR-0006 (Cedar): every admin API call is authorized by the existing policy engine; admin UI has no privileged path.
  • ADR-0011 (DID identity) / ADR-0010 (Biscuit): operator login reuses the DID + biscuit stack where available; in its absence (dev mode) a session cookie bound to a configured admin DID is the fallback. See open questions.

Phased delivery (M10 scope)

Phase Content Ships
1 trails.ingest — PDF + HTML + markdown extractors, ParagraphChunker, KGLoader with PROV-O, resume tokens v3.3.0
2 trails.vectorSqliteVecStore, SentenceTransformersEmbedder, trails.retrieve(mode="hybrid") v3.4.0
3 trails-admin MVP — capability forms, run inspector, ingestion dashboard v3.5.0-rc
4 trails-admin — graph browser + cost/budget views v3.5.0

Phases 1 and 2 are independently useful. Phase 3 is blocked on phases 1 and 2 shipping, because the dashboards' reference data come from those subsystems.

Concrete surface sketch

Ingest a directory of PDFs into the KG.

from pathlib import Path
from trails import Store
from trails.ingest import Pipeline, PdfExtractor, ParagraphChunker, KGLoader

store = Store.open("app.db")
pipe = Pipeline(
    extractors=[PdfExtractor()],
    chunker=ParagraphChunker(max_tokens=512),
    loader=KGLoader(store, graph="app:evidence"),
)
report = pipe.run(Path("./corpus/"))
print(f"{len(report.documents)} docs, {report.total_chunks} chunks, "
      f"resume={report.resume_token}")

Hybrid retrieval: SPARQL filter + vector similarity.

from trails import retrieve

hits = retrieve(
    "evidence that the defendant was present at the scene",
    mode="hybrid",
    k=20,
    sparql_filter="""
        ?chunk schema:isPartOf ?doc .
        ?doc schema:dateCreated ?d .
        FILTER (?d >= "2024-01-01"^^xsd:date)
    """,
)
for h in hits[:3]:
    print(h.iri, h.score, h.snippet[:80])

Launch the admin UI against a running Trails app.

pip install 'trails[admin]'
trails admin serve --app myapp:app --bind 127.0.0.1:4455
# open http://127.0.0.1:4455 — graph browser, capability forms, run
# inspector, ingestion dashboard, cost views, all from the registry.

Consequences

Positive

  • Compliance-shaped apps shippable in days, not weeks. Ingestion
  • retrieval + operator UI are the three week-long side-quests on every such project today.
  • Admin UI is executable documentation. New users see every registered capability as a working form the moment they install the framework — closes the gap between docs and reality.
  • Rails analogy becomes concrete. Scaffolds, console-adjacent inspector, and convention-over-configuration now have real referents.
  • Provenance stops being theoretical. The run inspector turns ADR-0009 into a daily-use surface; operators see the chain.
  • Dogfooding for ADR-0005 and ADR-0017. Admin UI exercises the manifest projections and the ActiveGraph surface end to end; any weakness surfaces immediately.

Negative

  • Significant new dependency surface. pypdf, pdfplumber, trafilatura, python-docx, sqlite-vec, sentence- transformers, qdrant-client, plus React / Vite / shadcn in the admin package. All optional via extras, but the matrix of supported combinations grows.
  • JS build pipeline enters the repo. The admin package requires Node for contributors. Release wheels ship prebuilt bundles; a reproducible-builds hook (ADR-0014) must extend to the JS toolchain.
  • Security surface expands. Admin UI is an authenticated, privileged surface. Misconfigured it is a data exfiltration path for the entire KG. Mandatory Cedar policy gate + DID-bound login
  • deny-by-default for non-admin principals.
  • Vector retrieval is a correctness hazard. Hybrid scoring is opinionated (RRF); users will want to tune fusion, and tuning surface discipline is a new maintenance load.
  • Admin UI versioning. Decoupling trails-admin from trails core is ergonomic but creates a compatibility matrix. See open questions.

Neutral

  • Optional install keeps the core small. pip install trails is unchanged in size for users who do not want ingestion, vectors, or the admin UI.
  • Existing apps are unaffected. No breaking change to the kernel, the shape surface, the manifest, or the HTTP adapter. Pre-M10 apps keep working and can adopt primitives piecewise.

Non-consequences

  • The kernel (Rust) is unchanged. All three primitives are Python- only, riding on existing GraphStore and capability plumbing.
  • The MCP transport (ADR-0008) is unchanged. Admin UI speaks HTTP, not MCP.
  • Oxigraph remains the default triple store (ADR-0007); vector adapters are orthogonal.

Scope fence (explicit non-goals)

  • Not a full DMS / CMS. Ingestion is file-level — no workflow, approval chains, check-in / check-out, or document lifecycle management. Users wanting a CMS integrate one and point trails.ingest at its outputs.
  • Not a new vector database. trails.vector is an adapter layer; we will never ship our own ANN index, persistence layer, or sharding protocol. If sqlite-vec, Qdrant, and Oxigraph literals are all inadequate, users write an adapter.
  • Not a theming system. The admin UI in v1 uses shadcn defaults and is functional, not brandable. No theming API, no slot overrides, no white-label path.
  • Not a dashboard builder. Cost / budget / quota views are fixed. Users wanting custom dashboards build them against the HTTP adapter; the admin UI is not a BI tool.
  • Not an LLM orchestration layer. trails.retrieve returns hits; it does not assemble prompts, call LLMs, or run RAG end-to-end. Users compose those on top.

Alternatives considered

  1. Depend on LlamaIndex for ingestion + retrieval. Rejected. LlamaIndex imports a large opinionated surface (nodes, readers, indexes) whose lifecycle does not cleanly emit PROV-O (ADR-0009) without monkeypatching, and whose cost model conflicts with ADR-0012. Importing it as a dependency would effectively invert the stack — Trails becomes a plugin to LlamaIndex rather than the other way around.
  2. No admin UI in core. Rejected. Every consumer reinvents one; the Rails analogy collapses without it; and the manifest (ADR-0005) is unobservable to operators who cannot write SPARQL.
  3. Admin UI as a purely third-party package. Possible long term, and the package boundary chosen here enables it. For v1 we ship a coordinated reference implementation because the shape surface, the manifest projections, and the provenance graph need to evolve together with the UI to catch misfits early. We will re-evaluate at v4.
  4. SQLite FTS5 instead of vectors. Rejected as the default. FTS5 handles lexical retrieval well but cannot do semantic similarity, which is the load-bearing case for the reference application. Users who want lexical-only can select NullEmbedder + SPARQL filter and never pay the vector cost.
  5. Put the vector store inside Oxigraph via custom literals. Shipped as one of three backends (OxigraphLiteralStore), but not the default. See open questions — this may become the default for tiny apps.

Open questions

  1. Admin UI authentication: DID / biscuit or session cookie? Does admin UI login piggyback on ADR-0011 DIDs + ADR-0010 biscuits (principled, but uneven DX for operators who have never seen a DID), or do we ship a plain session-cookie login bound to a configured admin DID (ergonomic, but a second auth path to maintain)? This is the hardest question in the ADR; it blocks phase 3. Leaning: DID-primary with a dev-mode cookie fallback, but the cookie path's threat model needs review before phase 3 starts.
  2. Vector store co-location with Oxigraph. For tiny apps, is OxigraphLiteralStore the default (zero additional dependency, one store to back up) or always a fallback? The query-time cost of cosine-over-literals at 100k chunks needs a benchmark before we decide.
  3. Admin UI versioning. trails-admin as a separate distribution raises the compatibility question: does the admin UI pin an exact trails version (safe, restrictive), a compatible range (>=3.5,<4), or support N and N-1 Python surfaces simultaneously? Affects the release cadence.
  4. PROV-O granularity for chunking. Is every chunk its own prov:Entity with prov:wasDerivedFrom to the parent document (expressive, storage-heavy), or is the pipeline a single activity with documents as entities and chunks as schema:hasPart children (cheap, less traceable)? Probably the latter by default with an opt-in to the former.
  5. Embedding provider secrets. Where do OpenAIEmbedder credentials live — trails.toml, env vars, the Cedar policy store, or a new trails secrets surface? Touches ADR-0014 (supply chain) and the as-yet-undesigned secrets story.
  6. Admin UI and Cedar authz for graph browser. The graph browser shows every triple a principal is entitled to see, which means every node / edge render is a Cedar decision. Do we cache decisions per session, or re-evaluate per render? Performance envelope is unclear without a prototype.