Skip to content

Document ingestion

trails.ingest turns PDF, HTML, Markdown, DOCX, RTF, and plain-text documents into typed Document + Chunk nodes inside the kernel KG — with always-on PROV-O (ADR-0009) and hash-based dedup so re-ingesting a file is a no-op. It is the M10 Phase 1 slice of ADR-0019: the non-deferrable piece for any compliance-shaped app, and the upstream for trails.vector.

Quickstart

from trails.testing import fresh_context
from trails.ingest import ingest_file, ingest_directory

ctx = fresh_context()

# Single file — returns the minted Document IRI.
doc_iri = ingest_file("paper.pdf", ctx, source="arxiv:2404.12345")

# Directory walk (recursive by default).
report = ingest_directory("./corpus", ctx, glob="*.pdf")
print(f"Ingested {len(report)} documents, {len(report.failures)} failures")

# Re-ingesting the same file is idempotent — the content hash already exists.
same_iri = ingest_file("paper.pdf", ctx)
assert same_iri == doc_iri

Each ingest_file run extracts → chunks → loads → emits a prov:Activity. ingest_directory fans out per file, collects failures into an IngestReport, and emits one batch activity.

Extractors

Install the optional text backends with:

pip install 'trails[ingest]'

Each backend is lazy-imported. When a dep is missing the extractor raises MissingExtractorDep (a TrailsError subclass) with the exact pip install line — no silent fallbacks, no raw ImportError.

MIME Extensions Extractor Dep
application/pdf .pdf extract_pdf pypdf
text/html, application/xhtml+xml .html, .htm, .xhtml extract_html trafilatura
text/markdown, text/x-rst .md, .markdown, .rst extract_markdown markdown-it-py
application/vnd.openxmlformats-officedocument.wordprocessingml.document .docx extract_docx python-docx
application/rtf .rtf extract_rtf stdlib
text/plain .txt, .text extract_text stdlib

PDF concatenates per-page text with blank lines between pages so the chunker still sees paragraph breaks; image-only pages are treated as empty, never raise.

HTML runs through trafilatura.extract which strips navigation, ads, and chrome. A naive tag-strip + whitespace-collapse fallback kicks in for fragments trafilatura returns None for.

Markdown strips a leading YAML front-matter block, then returns the source verbatim — Markdown doubles as plain text for paragraph chunking. markdown-it-py is still called (MarkdownIt().parse(text)) purely as a validation gate so catastrophically malformed input fails fast rather than producing nonsense chunks.

Plain text (extract_text) probes the first 8 KB as strict UTF-8. On a clean probe it decodes the full payload strictly; if that still fails (non-UTF-8 bytes past the probe window), it logs a second warning and falls back to latin-1. Leading UTF-8 BOMs are stripped and \r\n / lone \r line endings are normalised to \n.

DOCX (landed in commit 44f6584) uses python-docx. Body paragraphs are kept with internal whitespace collapsed. Tables are rendered row-wise with " | " between cells and "\n" between rows; each table becomes one block. Paragraphs whose style name starts with Header or Footer (case-insensitive, covers localised variants) are dropped. Images, charts, and OLE objects are silently skipped — OCR is an ADR-0019 non-goal. Paragraphs and tables are joined with \n\n so the chunker sees the boundaries.

RTF is a stdlib-only stripper for the "someone emailed me an RTF with plain paragraphs" case. It walks {\dest ...} group nesting to drop hidden destinations (fonttbl, stylesheet, info, pict, object, and siblings), translates \par / \line / \tab, decodes \uNNNN as signed 16-bit, strips remaining control words, and collapses braces. For rich RTF install a dedicated parser and pass the extracted text through ingest_file(..., mime="text/plain").

Chunker

from trails.ingest import paragraph_chunker

chunks = paragraph_chunker(text, min_chars=100, max_chars=1500)

The default chunker is deterministic and NLP-free. Algorithm:

  1. Split on runs of two+ newlines (paragraph break).
  2. Merge paragraphs shorter than min_chars into the next paragraph (stray headings join their body); a trailing short paragraph glues to the previous chunk.
  3. Cap paragraphs longer than max_chars by splitting at sentence boundaries ((?<=[.!?])\s+), hard-slicing when a single sentence still overshoots.

Each Chunk carries start_char / end_char offsets into the source text for highlight-on-hover and per-chunk provenance uses. Plug in a different chunker by passing chunker=my_fn to ingest_file / ingest_directory — the protocol is a bare Callable[[str], list[Chunk]], no base class.

Loader + node types

Document and Chunk are ordinary @node_type classes (see the ORM guide), so every capability that reads / writes them goes through the same ORM surface as user code.

from trails.ingest import Document, Chunk

doc = Document.find(ctx, doc_iri)
chunks = Chunk.where(document=doc).order_by("index").fetch(ctx)
Field (Document) Type Notes
uri str Source URI (file://, arxiv:, https://, …)
title str \| None Not auto-extracted in Phase 1
mime str Detected MIME
content_hash str 16-char base64 SHA-256 of extracted text
byte_size int UTF-8 byte length of extracted text
ingested_at datetime UTC timestamp
Field (Chunk) Type Notes
document Document (ref) Back-reference, stored as IRI triple
index int 0-based position
text str Chunk text
start_char / end_char int Offsets into extracted text
text_hash str Per-chunk 16-char hash — the seam for trails.vector

Dedup runs on content_hash_of(extracted_text) — SHA-256 → urlsafe base64 → first 16 chars (96 bits of entropy). We hash the extracted text, not the raw file bytes: two PDFs with identical prose but different embedded metadata dedup correctly. ingest_file calls find_document_by_hash before every load, so a repeat returns the pre-existing IRI without touching the KG.

Pipeline

ingest_file(
    path,                       # str | Path — must exist on disk
    ctx,                        # trails.context.Context
    source=None,                # defaults to file:// URI
    mime=None,                  # defaults to detect_mime(path)
    title=None,
    chunker=paragraph_chunker,
)
ingest_directory(
    path,
    ctx,
    glob="*",                   # "*.pdf", "*.md", …
    recursive=True,
    chunker=paragraph_chunker,
    source_fn=None,             # Path -> str
) -> IngestReport

IngestReport acts like the list of Document IRIs (len, iteration) with report.failures: list[IngestFailure] and report.activity_iri: str | None attached. Per-file failures (missing extractor dep, extractor error, empty text) are captured and the walk continues — nothing aborts a batch.

PROV-O. Every ingest_file call emits one prov:Activity in the framework's provenance graph (https://trails.dev/ns/prov/) with prov:generated edges to the minted Document IRI and an https://trails.dev/ns/prov/activityKind literal of "trails.ingest.ingest_file". ingest_directory emits one activity per batch covering every successfully-ingested document; failures are not asserted (a failed extraction produced no entity). Per-chunk prov:Entity emission is deferred to Phase 2.

Chunk → vector store

After ingest, embed each Chunk's text and index the vector under the chunk's IRI. The vector store carries no KG awareness — the hybrid retriever relies on a single metadata["iri"] convention to bridge back. See the vector guide for the store + embedder surface.

from trails.ingest import Chunk, ingest_file
from trails.vector import SqliteVecStore, SentenceTransformerEmbedder

doc_iri = ingest_file("paper.pdf", ctx)

embedder = SentenceTransformerEmbedder(model="all-MiniLM-L6-v2")
store = SqliteVecStore(path="vectors.db", dim=embedder.dim)

doc = Document.find(ctx, doc_iri)
for chunk in Chunk.where(document=doc).order_by("index").fetch(ctx):
    store.add(
        id=chunk.id,
        vector=embedder.embed(chunk.text),
        metadata={"iri": chunk.id, "doc_iri": doc.id, "snippet": chunk.text[:280]},
    )

Two details earn their place. chunk.text_hash is the seam for embedding dedup — a Phase 2 indexer can skip chunks whose hash already has a vector. And metadata["iri"] MUST be the chunk IRI for hybrid retrieval to work; metadata["snippet"] is picked up by RetrievalHit.snippet automatically.

Hybrid retrieval

Once chunks are indexed, pair a SPARQL narrow with the vector rerank:

from trails.vector import retrieve

hits = retrieve(
    "evidence the defendant was present",
    ctx=ctx,
    mode="hybrid",
    k=20,
    sparql_filter="""
        SELECT ?iri WHERE {
            ?iri <trails://app/Chunk/document> ?doc .
            ?doc <trails://app/Document/ingested_at> ?t .
            FILTER (?t >= "2024-01-01T00:00:00Z"^^xsd:dateTime)
        }
    """,
    vector_store=store,
    embedder=embedder,
)

retrieve runs the SPARQL to build a candidate IRI set, embeds the query, over-fetches from the store (4 * k), and keeps only hits whose metadata["iri"] is in the candidate set. Filter-then-rank is correct in both regimes — see vector.md for the tradeoff.

Anti-patterns

Skipping the chunker. Writing the full extracted text to a single Document.text field (or a one-item chunk list) defeats the point: the vector retriever can only score what the store indexes, and one 80 KB chunk produces one embedding. Always chunk; the default paragraph chunker is deterministic and free.

Re-ingesting without the hash check. Every ingest_file already calls find_document_by_hash — use it. Bypassing the pipeline (e.g., calling load_document directly on a file you've already ingested) doubles storage, duplicates prov:generated edges, and breaks downstream dedup assumptions.

Reference

Every public name from trails.ingest:

Symbol Shape
ingest_file(path, ctx, *, source=None, mime=None, title=None, chunker=paragraph_chunker) -> str (Document IRI)
ingest_directory(path, ctx, *, glob="*", recursive=True, chunker=paragraph_chunker, source_fn=None) -> IngestReport
IngestReport .documents, .failures, .activity_iri; len + iter
IngestFailure .path, .error, .error_type
Document @node_typeuri, title, mime, content_hash, byte_size, ingested_at
Chunk @node_typedocument, index, text, start_char, end_char, text_hash
load_document(ctx, text, *, uri, mime, chunks, title=None, ingested_at=None) -> str (pre-chunked fast path)
content_hash_of(text) -> str (16-char)
find_document_by_hash(ctx, content_hash) -> Document \| None
file_uri(path) -> str (file://…)
RawChunk Dataclass — index, text, start_char, end_char
paragraph_chunker(text, *, min_chars=100, max_chars=1500) -> list[Chunk]
extract(path_or_content, *, mime=None) Dispatcher
extract_pdf, extract_html, extract_markdown, extract_docx, extract_text, extract_rtf Per-format backends
detect_mime(path) -> str
MissingExtractorDep TrailsError subclass

See also