Document ingestion¶

trails.ingest turns PDF, HTML, Markdown, DOCX, RTF, and plain-text documents into typed Document + Chunk nodes inside the kernel KG — with always-on PROV-O (ADR-0009) and hash-based dedup so re-ingesting a file is a no-op. It is the M10 Phase 1 slice of ADR-0019: the non-deferrable piece for any compliance-shaped app, and the upstream for trails.vector.

Quickstart¶

from trails.testing import fresh_context
from trails.ingest import ingest_file, ingest_directory

ctx = fresh_context()

# Single file — returns the minted Document IRI.
doc_iri = ingest_file("paper.pdf", ctx, source="arxiv:2404.12345")

# Directory walk (recursive by default).
report = ingest_directory("./corpus", ctx, glob="*.pdf")
print(f"Ingested {len(report)} documents, {len(report.failures)} failures")

# Re-ingesting the same file is idempotent — the content hash already exists.
same_iri = ingest_file("paper.pdf", ctx)
assert same_iri == doc_iri

Each ingest_file run extracts → chunks → loads → emits a prov:Activity. ingest_directory fans out per file, collects failures into an IngestReport, and emits one batch activity.

Extractors¶

Install the optional text backends with:

pip install 'trails[ingest]'

Each backend is lazy-imported. When a dep is missing the extractor raises MissingExtractorDep (a TrailsError subclass) with the exact pip install line — no silent fallbacks, no raw ImportError.

MIME	Extensions	Extractor	Dep
`application/pdf`	`.pdf`	`extract_pdf`	`pypdf`
`text/html`, `application/xhtml+xml`	`.html`, `.htm`, `.xhtml`	`extract_html`	`trafilatura`
`text/markdown`, `text/x-rst`	`.md`, `.markdown`, `.rst`	`extract_markdown`	`markdown-it-py`
`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	`.docx`	`extract_docx`	`python-docx`
`application/rtf`	`.rtf`	`extract_rtf`	stdlib
`text/plain`	`.txt`, `.text`	`extract_text`	stdlib

PDF concatenates per-page text with blank lines between pages so the chunker still sees paragraph breaks; image-only pages are treated as empty, never raise.

HTML runs through trafilatura.extract which strips navigation, ads, and chrome. A naive tag-strip + whitespace-collapse fallback kicks in for fragments trafilatura returns None for.

Markdown strips a leading YAML front-matter block, then returns the source verbatim — Markdown doubles as plain text for paragraph chunking. markdown-it-py is still called (MarkdownIt().parse(text)) purely as a validation gate so catastrophically malformed input fails fast rather than producing nonsense chunks.

Plain text (extract_text) probes the first 8 KB as strict UTF-8. On a clean probe it decodes the full payload strictly; if that still fails (non-UTF-8 bytes past the probe window), it logs a second warning and falls back to latin-1. Leading UTF-8 BOMs are stripped and \r\n / lone \r line endings are normalised to \n.

DOCX (landed in commit 44f6584) uses python-docx. Body paragraphs are kept with internal whitespace collapsed. Tables are rendered row-wise with " | " between cells and "\n" between rows; each table becomes one block. Paragraphs whose style name starts with Header or Footer (case-insensitive, covers localised variants) are dropped. Images, charts, and OLE objects are silently skipped — OCR is an ADR-0019 non-goal. Paragraphs and tables are joined with \n\n so the chunker sees the boundaries.

RTF is a stdlib-only stripper for the "someone emailed me an RTF with plain paragraphs" case. It walks {\dest ...} group nesting to drop hidden destinations (fonttbl, stylesheet, info, pict, object, and siblings), translates \par / \line / \tab, decodes \uNNNN as signed 16-bit, strips remaining control words, and collapses braces. For rich RTF install a dedicated parser and pass the extracted text through ingest_file(..., mime="text/plain").

Chunker¶

from trails.ingest import paragraph_chunker

chunks = paragraph_chunker(text, min_chars=100, max_chars=1500)

The default chunker is deterministic and NLP-free. Algorithm:

Split on runs of two+ newlines (paragraph break).
Merge paragraphs shorter than min_chars into the next paragraph (stray headings join their body); a trailing short paragraph glues to the previous chunk.
Cap paragraphs longer than max_chars by splitting at sentence boundaries ((?<=[.!?])\s+), hard-slicing when a single sentence still overshoots.

Each Chunk carries start_char / end_char offsets into the source text for highlight-on-hover and per-chunk provenance uses. Plug in a different chunker by passing chunker=my_fn to ingest_file / ingest_directory — the protocol is a bare Callable[[str], list[Chunk]], no base class.

Loader + node types¶

Document and Chunk are ordinary @node_type classes (see the ORM guide), so every capability that reads / writes them goes through the same ORM surface as user code.

from trails.ingest import Document, Chunk

doc = Document.find(ctx, doc_iri)
chunks = Chunk.where(document=doc).order_by("index").fetch(ctx)

Field (Document)	Type	Notes
`uri`	`str`	Source URI (`file://`, `arxiv:`, `https://`, …)
`title`	`str \\| None`	Not auto-extracted in Phase 1
`mime`	`str`	Detected MIME
`content_hash`	`str`	16-char base64 SHA-256 of extracted text
`byte_size`	`int`	UTF-8 byte length of extracted text
`ingested_at`	`datetime`	UTC timestamp

Field (Chunk)	Type	Notes
`document`	`Document` (ref)	Back-reference, stored as IRI triple
`index`	`int`	0-based position
`text`	`str`	Chunk text
`start_char` / `end_char`	`int`	Offsets into extracted text
`text_hash`	`str`	Per-chunk 16-char hash — the seam for `trails.vector`

Dedup runs on content_hash_of(extracted_text) — SHA-256 → urlsafe base64 → first 16 chars (96 bits of entropy). We hash the extracted text, not the raw file bytes: two PDFs with identical prose but different embedded metadata dedup correctly. ingest_file calls find_document_by_hash before every load, so a repeat returns the pre-existing IRI without touching the KG.

Pipeline¶

ingest_file(
    path,                       # str | Path — must exist on disk
    ctx,                        # trails.context.Context
    source=None,                # defaults to file:// URI
    mime=None,                  # defaults to detect_mime(path)
    title=None,
    chunker=paragraph_chunker,
)

ingest_directory(
    path,
    ctx,
    glob="*",                   # "*.pdf", "*.md", …
    recursive=True,
    chunker=paragraph_chunker,
    source_fn=None,             # Path -> str
) -> IngestReport

IngestReport acts like the list of Document IRIs (len, iteration) with report.failures: list[IngestFailure] and report.activity_iri: str | None attached. Per-file failures (missing extractor dep, extractor error, empty text) are captured and the walk continues — nothing aborts a batch.

PROV-O. Every ingest_file call emits one prov:Activity in the framework's provenance graph (https://trails.dev/ns/prov/) with prov:generated edges to the minted Document IRI and an https://trails.dev/ns/prov/activityKind literal of "trails.ingest.ingest_file". ingest_directory emits one activity per batch covering every successfully-ingested document; failures are not asserted (a failed extraction produced no entity). Per-chunk prov:Entity emission is deferred to Phase 2.

Chunk → vector store¶

After ingest, embed each Chunk's text and index the vector under the chunk's IRI. The vector store carries no KG awareness — the hybrid retriever relies on a single metadata["iri"] convention to bridge back. See the vector guide for the store + embedder surface.

from trails.ingest import Chunk, ingest_file
from trails.vector import SqliteVecStore, SentenceTransformerEmbedder

doc_iri = ingest_file("paper.pdf", ctx)

embedder = SentenceTransformerEmbedder(model="all-MiniLM-L6-v2")
store = SqliteVecStore(path="vectors.db", dim=embedder.dim)

doc = Document.find(ctx, doc_iri)
for chunk in Chunk.where(document=doc).order_by("index").fetch(ctx):
    store.add(
        id=chunk.id,
        vector=embedder.embed(chunk.text),
        metadata={"iri": chunk.id, "doc_iri": doc.id, "snippet": chunk.text[:280]},
    )

Two details earn their place. chunk.text_hash is the seam for embedding dedup — a Phase 2 indexer can skip chunks whose hash already has a vector. And metadata["iri"] MUST be the chunk IRI for hybrid retrieval to work; metadata["snippet"] is picked up by RetrievalHit.snippet automatically.

Hybrid retrieval¶

Once chunks are indexed, pair a SPARQL narrow with the vector rerank:

from trails.vector import retrieve

hits = retrieve(
    "evidence the defendant was present",
    ctx=ctx,
    mode="hybrid",
    k=20,
    sparql_filter="""
        SELECT ?iri WHERE {
            ?iri <trails://app/Chunk/document> ?doc .
            ?doc <trails://app/Document/ingested_at> ?t .
            FILTER (?t >= "2024-01-01T00:00:00Z"^^xsd:dateTime)
        }
    """,
    vector_store=store,
    embedder=embedder,
)

retrieve runs the SPARQL to build a candidate IRI set, embeds the query, over-fetches from the store (4 * k), and keeps only hits whose metadata["iri"] is in the candidate set. Filter-then-rank is correct in both regimes — see vector.md for the tradeoff.

Anti-patterns¶

Skipping the chunker. Writing the full extracted text to a single Document.text field (or a one-item chunk list) defeats the point: the vector retriever can only score what the store indexes, and one 80 KB chunk produces one embedding. Always chunk; the default paragraph chunker is deterministic and free.

Re-ingesting without the hash check. Every ingest_file already calls find_document_by_hash — use it. Bypassing the pipeline (e.g., calling load_document directly on a file you've already ingested) doubles storage, duplicates prov:generated edges, and breaks downstream dedup assumptions.

Reference¶

Every public name from trails.ingest:

Symbol	Shape
`ingest_file(path, ctx, *, source=None, mime=None, title=None, chunker=paragraph_chunker)`	`-> str` (Document IRI)
`ingest_directory(path, ctx, , glob="", recursive=True, chunker=paragraph_chunker, source_fn=None)`	`-> IngestReport`
`IngestReport`	`.documents`, `.failures`, `.activity_iri`; `len` + iter
`IngestFailure`	`.path`, `.error`, `.error_type`
`Document`	`@node_type` — `uri`, `title`, `mime`, `content_hash`, `byte_size`, `ingested_at`
`Chunk`	`@node_type` — `document`, `index`, `text`, `start_char`, `end_char`, `text_hash`
`load_document(ctx, text, *, uri, mime, chunks, title=None, ingested_at=None)`	`-> str` (pre-chunked fast path)
`content_hash_of(text)`	`-> str` (16-char)
`find_document_by_hash(ctx, content_hash)`	`-> Document \\| None`
`file_uri(path)`	`-> str` (`file://…`)
`RawChunk`	Dataclass — `index`, `text`, `start_char`, `end_char`
`paragraph_chunker(text, *, min_chars=100, max_chars=1500)`	`-> list[Chunk]`
`extract(path_or_content, *, mime=None)`	Dispatcher
`extract_pdf`, `extract_html`, `extract_markdown`, `extract_docx`, `extract_text`, `extract_rtf`	Per-format backends
`detect_mime(path)`	`-> str`
`MissingExtractorDep`	`TrailsError` subclass