Skip to content

Document Ingestion

Turn on-disk documents into typed Document + Chunk triples in the KG with PROV-O provenance and hash-based dedup.

Auto-generated docs

When trails is installed, run ENABLE_MKDOCSTRINGS=true ./scripts/docs-build for full docstring-extracted reference.

Pipeline

Symbol Signature Description
ingest_file ingest_file(path: str \| Path, ctx, *, source: str \| None = None, mime: str \| None = None, title: str \| None = None, chunker: ChunkerFn = paragraph_chunker, rml_mapping: str \| Path \| None = None) -> str Ingest a single file. Returns the minted Document IRI. Idempotent (same content hash = no-op)
ingest_directory ingest_directory(path: str \| Path, ctx, *, glob: str = "*", recursive: bool = True, chunker: ChunkerFn = paragraph_chunker, source_fn: Callable \| None = None) -> IngestReport Bulk ingest a directory. Best-effort; per-file failures collected in IngestReport.failures

Extractors

Symbol Signature Description
extract extract(path: str \| Path, mime: str \| None = None) -> str Auto-detect MIME and extract text from a file
extract_pdf extract_pdf(path: str \| Path) -> str Extract text from PDF (requires pymupdf or pdfplumber)
extract_html extract_html(path: str \| Path) -> str Extract text from HTML (requires beautifulsoup4)
extract_markdown extract_markdown(path: str \| Path) -> str Extract text from Markdown
extract_text extract_text(path: str \| Path) -> str Read plain text
detect_mime detect_mime(path: str \| Path) -> str Detect MIME type from file extension

Chunker

Symbol Signature Description
paragraph_chunker paragraph_chunker(text: str) -> list[Chunk] Split text into paragraph-level chunks

Data types

Symbol Signature Description
Document @node_type Node type for ingested documents
Chunk @node_type Node type for document chunks
IngestReport IngestReport(documents: list[str], failures: list[IngestFailure]) Batch ingest summary (iterable as list of Document IRIs)
IngestFailure IngestFailure(path: str, error: str, error_type: str) A single file that failed during batch ingest