Document Ingestion
Turn on-disk documents into typed Document + Chunk triples in the KG with PROV-O provenance and hash-based dedup.
Auto-generated docs
When trails is installed, run ENABLE_MKDOCSTRINGS=true ./scripts/docs-build
for full docstring-extracted reference.
Pipeline
| Symbol |
Signature |
Description |
ingest_file |
ingest_file(path: str \| Path, ctx, *, source: str \| None = None, mime: str \| None = None, title: str \| None = None, chunker: ChunkerFn = paragraph_chunker, rml_mapping: str \| Path \| None = None) -> str |
Ingest a single file. Returns the minted Document IRI. Idempotent (same content hash = no-op) |
ingest_directory |
ingest_directory(path: str \| Path, ctx, *, glob: str = "*", recursive: bool = True, chunker: ChunkerFn = paragraph_chunker, source_fn: Callable \| None = None) -> IngestReport |
Bulk ingest a directory. Best-effort; per-file failures collected in IngestReport.failures |
| Symbol |
Signature |
Description |
extract |
extract(path: str \| Path, mime: str \| None = None) -> str |
Auto-detect MIME and extract text from a file |
extract_pdf |
extract_pdf(path: str \| Path) -> str |
Extract text from PDF (requires pymupdf or pdfplumber) |
extract_html |
extract_html(path: str \| Path) -> str |
Extract text from HTML (requires beautifulsoup4) |
extract_markdown |
extract_markdown(path: str \| Path) -> str |
Extract text from Markdown |
extract_text |
extract_text(path: str \| Path) -> str |
Read plain text |
detect_mime |
detect_mime(path: str \| Path) -> str |
Detect MIME type from file extension |
Chunker
| Symbol |
Signature |
Description |
paragraph_chunker |
paragraph_chunker(text: str) -> list[Chunk] |
Split text into paragraph-level chunks |
Data types
| Symbol |
Signature |
Description |
Document |
@node_type |
Node type for ingested documents |
Chunk |
@node_type |
Node type for document chunks |
IngestReport |
IngestReport(documents: list[str], failures: list[IngestFailure]) |
Batch ingest summary (iterable as list of Document IRIs) |
IngestFailure |
IngestFailure(path: str, error: str, error_type: str) |
A single file that failed during batch ingest |