Document ingestion¶
trails.ingest turns PDF, HTML, Markdown, DOCX, RTF, and plain-text
documents into typed Document + Chunk nodes inside the kernel KG —
with always-on PROV-O
(ADR-0009) and hash-based
dedup so re-ingesting a file is a no-op. It is the M10 Phase 1 slice of
ADR-0019: the non-deferrable piece
for any compliance-shaped app, and the upstream for
trails.vector.
Quickstart¶
from trails.testing import fresh_context
from trails.ingest import ingest_file, ingest_directory
ctx = fresh_context()
# Single file — returns the minted Document IRI.
doc_iri = ingest_file("paper.pdf", ctx, source="arxiv:2404.12345")
# Directory walk (recursive by default).
report = ingest_directory("./corpus", ctx, glob="*.pdf")
print(f"Ingested {len(report)} documents, {len(report.failures)} failures")
# Re-ingesting the same file is idempotent — the content hash already exists.
same_iri = ingest_file("paper.pdf", ctx)
assert same_iri == doc_iri
Each ingest_file run extracts → chunks → loads → emits a
prov:Activity. ingest_directory fans out per file, collects
failures into an IngestReport, and emits one batch activity.
Extractors¶
Install the optional text backends with:
Each backend is lazy-imported. When a dep is missing the extractor
raises MissingExtractorDep (a TrailsError subclass) with the exact
pip install line — no silent fallbacks, no raw ImportError.
| MIME | Extensions | Extractor | Dep |
|---|---|---|---|
application/pdf |
.pdf |
extract_pdf |
pypdf |
text/html, application/xhtml+xml |
.html, .htm, .xhtml |
extract_html |
trafilatura |
text/markdown, text/x-rst |
.md, .markdown, .rst |
extract_markdown |
markdown-it-py |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
.docx |
extract_docx |
python-docx |
application/rtf |
.rtf |
extract_rtf |
stdlib |
text/plain |
.txt, .text |
extract_text |
stdlib |
PDF concatenates per-page text with blank lines between pages so the chunker still sees paragraph breaks; image-only pages are treated as empty, never raise.
HTML runs through trafilatura.extract which strips navigation,
ads, and chrome. A naive tag-strip + whitespace-collapse fallback kicks
in for fragments trafilatura returns None for.
Markdown strips a leading YAML front-matter block, then returns the
source verbatim — Markdown doubles as plain text for paragraph
chunking. markdown-it-py is still called (MarkdownIt().parse(text))
purely as a validation gate so catastrophically malformed input fails
fast rather than producing nonsense chunks.
Plain text (extract_text) probes the first 8 KB as strict
UTF-8. On a clean probe it decodes the full payload strictly; if that
still fails (non-UTF-8 bytes past the probe window), it logs a second
warning and falls back to latin-1. Leading UTF-8 BOMs are stripped
and \r\n / lone \r line endings are normalised to \n.
DOCX (landed in commit 44f6584) uses python-docx. Body
paragraphs are kept with internal whitespace collapsed. Tables are
rendered row-wise with " | " between cells and "\n" between rows;
each table becomes one block. Paragraphs whose style name starts with
Header or Footer (case-insensitive, covers localised variants) are
dropped. Images, charts, and OLE objects are silently skipped — OCR is
an ADR-0019 non-goal. Paragraphs and tables are joined with \n\n so
the chunker sees the boundaries.
RTF is a stdlib-only stripper for the "someone emailed me an RTF
with plain paragraphs" case. It walks {\dest ...} group nesting to
drop hidden destinations (fonttbl, stylesheet, info, pict,
object, and siblings), translates \par / \line / \tab, decodes
\uNNNN as signed 16-bit, strips remaining control words, and
collapses braces. For rich RTF install a dedicated parser and pass the
extracted text through ingest_file(..., mime="text/plain").
Chunker¶
from trails.ingest import paragraph_chunker
chunks = paragraph_chunker(text, min_chars=100, max_chars=1500)
The default chunker is deterministic and NLP-free. Algorithm:
- Split on runs of two+ newlines (paragraph break).
- Merge paragraphs shorter than
min_charsinto the next paragraph (stray headings join their body); a trailing short paragraph glues to the previous chunk. - Cap paragraphs longer than
max_charsby splitting at sentence boundaries ((?<=[.!?])\s+), hard-slicing when a single sentence still overshoots.
Each Chunk carries start_char / end_char offsets into the source
text for highlight-on-hover and per-chunk provenance uses. Plug in a
different chunker by passing chunker=my_fn to ingest_file /
ingest_directory — the protocol is a bare
Callable[[str], list[Chunk]], no base class.
Loader + node types¶
Document and Chunk are ordinary @node_type classes (see the
ORM guide), so every capability that reads / writes them
goes through the same ORM surface as user code.
from trails.ingest import Document, Chunk
doc = Document.find(ctx, doc_iri)
chunks = Chunk.where(document=doc).order_by("index").fetch(ctx)
| Field (Document) | Type | Notes |
|---|---|---|
uri |
str |
Source URI (file://, arxiv:, https://, …) |
title |
str \| None |
Not auto-extracted in Phase 1 |
mime |
str |
Detected MIME |
content_hash |
str |
16-char base64 SHA-256 of extracted text |
byte_size |
int |
UTF-8 byte length of extracted text |
ingested_at |
datetime |
UTC timestamp |
| Field (Chunk) | Type | Notes |
|---|---|---|
document |
Document (ref) |
Back-reference, stored as IRI triple |
index |
int |
0-based position |
text |
str |
Chunk text |
start_char / end_char |
int |
Offsets into extracted text |
text_hash |
str |
Per-chunk 16-char hash — the seam for trails.vector |
Dedup runs on content_hash_of(extracted_text) — SHA-256 → urlsafe
base64 → first 16 chars (96 bits of entropy). We hash the extracted
text, not the raw file bytes: two PDFs with identical prose but
different embedded metadata dedup correctly. ingest_file calls
find_document_by_hash before every load, so a repeat returns the
pre-existing IRI without touching the KG.
Pipeline¶
ingest_file(
path, # str | Path — must exist on disk
ctx, # trails.context.Context
source=None, # defaults to file:// URI
mime=None, # defaults to detect_mime(path)
title=None,
chunker=paragraph_chunker,
)
ingest_directory(
path,
ctx,
glob="*", # "*.pdf", "*.md", …
recursive=True,
chunker=paragraph_chunker,
source_fn=None, # Path -> str
) -> IngestReport
IngestReport acts like the list of Document IRIs (len, iteration)
with report.failures: list[IngestFailure] and
report.activity_iri: str | None attached. Per-file failures (missing
extractor dep, extractor error, empty text) are captured and the walk
continues — nothing aborts a batch.
PROV-O. Every ingest_file call emits one prov:Activity in the
framework's provenance graph (https://trails.dev/ns/prov/) with
prov:generated edges to the minted Document IRI and an
https://trails.dev/ns/prov/activityKind literal of
"trails.ingest.ingest_file". ingest_directory emits one activity
per batch covering every successfully-ingested document; failures
are not asserted (a failed extraction produced no entity). Per-chunk
prov:Entity emission is deferred to Phase 2.
Chunk → vector store¶
After ingest, embed each Chunk's text and index the vector under the
chunk's IRI. The vector store carries no KG awareness — the hybrid
retriever relies on a single metadata["iri"] convention to bridge
back. See the vector guide for the store + embedder
surface.
from trails.ingest import Chunk, ingest_file
from trails.vector import SqliteVecStore, SentenceTransformerEmbedder
doc_iri = ingest_file("paper.pdf", ctx)
embedder = SentenceTransformerEmbedder(model="all-MiniLM-L6-v2")
store = SqliteVecStore(path="vectors.db", dim=embedder.dim)
doc = Document.find(ctx, doc_iri)
for chunk in Chunk.where(document=doc).order_by("index").fetch(ctx):
store.add(
id=chunk.id,
vector=embedder.embed(chunk.text),
metadata={"iri": chunk.id, "doc_iri": doc.id, "snippet": chunk.text[:280]},
)
Two details earn their place. chunk.text_hash is the seam for
embedding dedup — a Phase 2 indexer can skip chunks whose hash already
has a vector. And metadata["iri"] MUST be the chunk IRI for hybrid
retrieval to work; metadata["snippet"] is picked up by
RetrievalHit.snippet automatically.
Hybrid retrieval¶
Once chunks are indexed, pair a SPARQL narrow with the vector rerank:
from trails.vector import retrieve
hits = retrieve(
"evidence the defendant was present",
ctx=ctx,
mode="hybrid",
k=20,
sparql_filter="""
SELECT ?iri WHERE {
?iri <trails://app/Chunk/document> ?doc .
?doc <trails://app/Document/ingested_at> ?t .
FILTER (?t >= "2024-01-01T00:00:00Z"^^xsd:dateTime)
}
""",
vector_store=store,
embedder=embedder,
)
retrieve runs the SPARQL to build a candidate IRI set, embeds the
query, over-fetches from the store (4 * k), and keeps only hits
whose metadata["iri"] is in the candidate set. Filter-then-rank is
correct in both regimes — see vector.md for the tradeoff.
Anti-patterns¶
Skipping the chunker. Writing the full extracted text to a single
Document.text field (or a one-item chunk list) defeats the point: the
vector retriever can only score what the store indexes, and one 80 KB
chunk produces one embedding. Always chunk; the default paragraph
chunker is deterministic and free.
Re-ingesting without the hash check. Every ingest_file already
calls find_document_by_hash — use it. Bypassing the pipeline (e.g.,
calling load_document directly on a file you've already ingested)
doubles storage, duplicates prov:generated edges, and breaks
downstream dedup assumptions.
Reference¶
Every public name from trails.ingest:
| Symbol | Shape |
|---|---|
ingest_file(path, ctx, *, source=None, mime=None, title=None, chunker=paragraph_chunker) |
-> str (Document IRI) |
ingest_directory(path, ctx, *, glob="*", recursive=True, chunker=paragraph_chunker, source_fn=None) |
-> IngestReport |
IngestReport |
.documents, .failures, .activity_iri; len + iter |
IngestFailure |
.path, .error, .error_type |
Document |
@node_type — uri, title, mime, content_hash, byte_size, ingested_at |
Chunk |
@node_type — document, index, text, start_char, end_char, text_hash |
load_document(ctx, text, *, uri, mime, chunks, title=None, ingested_at=None) |
-> str (pre-chunked fast path) |
content_hash_of(text) |
-> str (16-char) |
find_document_by_hash(ctx, content_hash) |
-> Document \| None |
file_uri(path) |
-> str (file://…) |
RawChunk |
Dataclass — index, text, start_char, end_char |
paragraph_chunker(text, *, min_chars=100, max_chars=1500) |
-> list[Chunk] |
extract(path_or_content, *, mime=None) |
Dispatcher |
extract_pdf, extract_html, extract_markdown, extract_docx, extract_text, extract_rtf |
Per-format backends |
detect_mime(path) |
-> str |
MissingExtractorDep |
TrailsError subclass |
See also¶
- ADR-0019 — app-surface design
- ADR-0009 — PROV-O contract
- Vector Retrieval — embedders, stores,
retrieve() - ActiveGraph ORM —
@node_type,Model.where, ref fields examples/ingestion-demo/— worked end-to-end example