Skip to content

ADR-0040: Multi-Modal KG Nodes — Images, Audio, and Documents as First-Class Graph Citizens

  • Status: Accepted (2026-04-19)
  • Date: 2026-04-18
  • Supersedes:
  • Superseded by:

Context

Trails ingests text documents (M10 Phase 1) and indexes them for hybrid retrieval (M10 Phase 2). Real-world knowledge graphs, however, carry more than text: medical imaging, audio recordings, architectural blueprints, scanned PDFs, product photographs. Today these artefacts live outside the KG — file systems, S3 buckets, specialised stores — with ad-hoc linking that the ORM, provenance, and retrieval layers cannot see.

The gap has concrete consequences:

  1. No provenance trail. A node referencing an image via a string URL has no prov:Activity linking the binary to its ingestion, no content hash for tamper detection, no MIME metadata for downstream tooling.
  2. No cross-modal retrieval. Text queries cannot surface relevant images; image queries cannot surface related text nodes. The vector store holds text embeddings only.
  3. No dedup. Identical files attached to different nodes are stored (and embedded) multiple times with no content-addressing.

Competing frameworks (LlamaIndex, LangChain, Haystack) treat binary assets as opaque blobs that pass through a pipeline but never become graph citizens. Trails can do better: binaries should be content- addressed, linked to nodes via standard predicates, and searchable through the same retrieval surface.

Decision

Binary attachments become first-class, opt-in citizens of the knowledge graph. A new module python/src/trails/multimodal.py provides:

1. Content-Addressed Attachment Store

Files are stored under data/attachments/<hash[:2]>/<hash> using the full SHA-256 hex digest as the identity key. This mirrors the Git object store pattern (two-char prefix directory for filesystem sanity).

store = AttachmentStore(base_dir="data/attachments")
att = store.store(data=raw_bytes, mime_type="image/png")
assert store.exists(att.content_hash)
assert store.retrieve(att.content_hash) == raw_bytes

Identical bytes (same SHA-256) are stored once regardless of how many nodes reference them. The store is a file tree — no database, no daemon. Backup = rsync.

2. Graph Linking via trails:hasAttachment

The attach() function creates a BinaryAttachment node in the KG and links it to the target node via the trails:hasAttachment predicate:

<node_iri> <https://trails.dev/ns/hasAttachment> <attachment_iri> .
<attachment_iri> a <https://trails.dev/ns/BinaryAttachment> .
<attachment_iri> <https://trails.dev/ns/contentHash> "sha256:abc..." .
<attachment_iri> <https://trails.dev/ns/mimeType> "image/png" .
<attachment_iri> <https://trails.dev/ns/sizeBytes> "12345" .
<attachment_iri> <https://trails.dev/ns/storagePath> "ab/abc123..." .

3. binary_field() Descriptor for @node_type

Opt-in at the model level — standard @node_type classes work unchanged. Authors who need binary attachments declare them explicitly:

@node_type("MedicalImage", fields={
    "patient_id": str,
    "scan_date": datetime,
    "image": binary_field(mime_types=["image/png", "image/dicom"]),
})
class MedicalImage: pass

binary_field() returns a field descriptor that the ORM recognises as a binary slot. On ctx.kg.add(instance) the ORM delegates binary data to the AttachmentStore and inserts the graph link. On read, the field hydrates as a BinaryAttachment metadata object (not the raw bytes — pull those explicitly via get_attachment_data()).

4. Provenance

Every attach() call emits a prov:Activity (kind trails.multimodal.attach) linking the attachment IRI as a prov:Entity. This lets the provenance explorer trace when and how a binary was associated with a node — same fidelity as text ingest.

5. Cross-Modal Retrieval (Future-Ready)

The module exposes multimodal_retrieve(ctx, query, mode="text", top_k=10) which queries the existing vector store. When multi-modal embedding providers are wired (CLIP, ImageBind), the same function will search across modalities in a unified vector space. Phase 1 supports text-only retrieval; the interface is stable.

Scope Fence

  • No blob storage in Oxigraph. Binary data is NEVER stored inside the triple store. The graph holds metadata and IRIs; files live on the filesystem.
  • No streaming. store() and retrieve() operate on bytes. Large-file streaming is out of scope for Phase 1.
  • No multi-modal embedding providers in this ADR. The embedding provider protocol (EmbeddingProvider) is unchanged; CLIP/ImageBind adapters are a follow-on.
  • No automatic thumbnail generation or format conversion.
  • No cloud storage backends (S3, GCS). The AttachmentStore operates on a local directory. A cloud adapter can subclass it later.

Consequences

  • @node_type gains an optional binary_field() descriptor — zero impact on existing models that don't use it.
  • AttachmentStore is a new dependency-free component (stdlib only).
  • The trails:hasAttachment predicate is framework-reserved; user ontologies must not collide.
  • Content-addressed dedup means storage grows with unique content, not with the number of references.
  • The provenance chain now covers binary assets, closing the audit gap for regulated domains (medical, legal, financial).

Alternatives Considered

  1. Store binaries as base64 literals in Oxigraph. Rejected: bloats the triple store, breaks SPARQL performance, violates the "graph for metadata, files for content" principle.
  2. External object store only (S3-style). Rejected for Phase 1: adds operational complexity. The local file store is sufficient and a cloud adapter can wrap it later.
  3. Reuse the ingest pipeline for binaries. Partially applicable — the ingest pipeline handles text extraction. Binary attachments are a different concern (no extraction, just storage + linking). The two systems compose: ingest a PDF for its text, attach the original PDF binary to the same Document node.