ADR-0040: Multi-Modal KG Nodes — Images, Audio, and Documents as First-Class Graph Citizens¶
- Status: Accepted (2026-04-19)
- Date: 2026-04-18
- Supersedes: —
- Superseded by: —
Context¶
Trails ingests text documents (M10 Phase 1) and indexes them for hybrid retrieval (M10 Phase 2). Real-world knowledge graphs, however, carry more than text: medical imaging, audio recordings, architectural blueprints, scanned PDFs, product photographs. Today these artefacts live outside the KG — file systems, S3 buckets, specialised stores — with ad-hoc linking that the ORM, provenance, and retrieval layers cannot see.
The gap has concrete consequences:
- No provenance trail. A node referencing an image via a string URL
has no
prov:Activitylinking the binary to its ingestion, no content hash for tamper detection, no MIME metadata for downstream tooling. - No cross-modal retrieval. Text queries cannot surface relevant images; image queries cannot surface related text nodes. The vector store holds text embeddings only.
- No dedup. Identical files attached to different nodes are stored (and embedded) multiple times with no content-addressing.
Competing frameworks (LlamaIndex, LangChain, Haystack) treat binary assets as opaque blobs that pass through a pipeline but never become graph citizens. Trails can do better: binaries should be content- addressed, linked to nodes via standard predicates, and searchable through the same retrieval surface.
Decision¶
Binary attachments become first-class, opt-in citizens of the
knowledge graph. A new module python/src/trails/multimodal.py
provides:
1. Content-Addressed Attachment Store¶
Files are stored under data/attachments/<hash[:2]>/<hash> using the
full SHA-256 hex digest as the identity key. This mirrors the Git
object store pattern (two-char prefix directory for filesystem sanity).
store = AttachmentStore(base_dir="data/attachments")
att = store.store(data=raw_bytes, mime_type="image/png")
assert store.exists(att.content_hash)
assert store.retrieve(att.content_hash) == raw_bytes
Identical bytes (same SHA-256) are stored once regardless of how many
nodes reference them. The store is a file tree — no database, no
daemon. Backup = rsync.
2. Graph Linking via trails:hasAttachment¶
The attach() function creates a BinaryAttachment node in the KG
and links it to the target node via the trails:hasAttachment
predicate:
<node_iri> <https://trails.dev/ns/hasAttachment> <attachment_iri> .
<attachment_iri> a <https://trails.dev/ns/BinaryAttachment> .
<attachment_iri> <https://trails.dev/ns/contentHash> "sha256:abc..." .
<attachment_iri> <https://trails.dev/ns/mimeType> "image/png" .
<attachment_iri> <https://trails.dev/ns/sizeBytes> "12345" .
<attachment_iri> <https://trails.dev/ns/storagePath> "ab/abc123..." .
3. binary_field() Descriptor for @node_type¶
Opt-in at the model level — standard @node_type classes work
unchanged. Authors who need binary attachments declare them explicitly:
@node_type("MedicalImage", fields={
"patient_id": str,
"scan_date": datetime,
"image": binary_field(mime_types=["image/png", "image/dicom"]),
})
class MedicalImage: pass
binary_field() returns a field descriptor that the ORM recognises as
a binary slot. On ctx.kg.add(instance) the ORM delegates binary data
to the AttachmentStore and inserts the graph link. On read, the
field hydrates as a BinaryAttachment metadata object (not the raw
bytes — pull those explicitly via get_attachment_data()).
4. Provenance¶
Every attach() call emits a prov:Activity (kind
trails.multimodal.attach) linking the attachment IRI as a
prov:Entity. This lets the provenance explorer trace when and how
a binary was associated with a node — same fidelity as text ingest.
5. Cross-Modal Retrieval (Future-Ready)¶
The module exposes multimodal_retrieve(ctx, query, mode="text",
top_k=10) which queries the existing vector store. When multi-modal
embedding providers are wired (CLIP, ImageBind), the same function
will search across modalities in a unified vector space. Phase 1
supports text-only retrieval; the interface is stable.
Scope Fence¶
- No blob storage in Oxigraph. Binary data is NEVER stored inside the triple store. The graph holds metadata and IRIs; files live on the filesystem.
- No streaming.
store()andretrieve()operate onbytes. Large-file streaming is out of scope for Phase 1. - No multi-modal embedding providers in this ADR. The embedding
provider protocol (
EmbeddingProvider) is unchanged; CLIP/ImageBind adapters are a follow-on. - No automatic thumbnail generation or format conversion.
- No cloud storage backends (S3, GCS). The
AttachmentStoreoperates on a local directory. A cloud adapter can subclass it later.
Consequences¶
@node_typegains an optionalbinary_field()descriptor — zero impact on existing models that don't use it.AttachmentStoreis a new dependency-free component (stdlib only).- The
trails:hasAttachmentpredicate is framework-reserved; user ontologies must not collide. - Content-addressed dedup means storage grows with unique content, not with the number of references.
- The provenance chain now covers binary assets, closing the audit gap for regulated domains (medical, legal, financial).
Alternatives Considered¶
- Store binaries as base64 literals in Oxigraph. Rejected: bloats the triple store, breaks SPARQL performance, violates the "graph for metadata, files for content" principle.
- External object store only (S3-style). Rejected for Phase 1: adds operational complexity. The local file store is sufficient and a cloud adapter can wrap it later.
- Reuse the ingest pipeline for binaries. Partially applicable — the ingest pipeline handles text extraction. Binary attachments are a different concern (no extraction, just storage + linking). The two systems compose: ingest a PDF for its text, attach the original PDF binary to the same Document node.