Skip to content

06 — Test Plan

Test pyramid (adjusted for agentic-semantic)

                    / \
                   / E \                 ← Agent-sim (nondeterministic, golden-shape)
                  /─────\
                 /  C    \               ← Conformance (cross-backend, cross-runtime)
                /─────────\
               /     I     \             ← Integration (process-level, real backends)
              /─────────────\
             /   Properties   \          ← Property (QuickCheck-style invariants)
            /─────────────────\
           /       Unit         \        ← Unit (fast, pure, per-module)
          /─────────────────────\

Five bands, each with distinct failure modes and tools.


Band 1 — Unit tests

Scope: per-function / per-trait-method. Pure where possible.

1.1 Rust kernel unit tests

Crate Coverage target Example
trails-graph ≥ 90% test_named_graph_isolation, test_snapshot_rollback
trails-shapes ≥ 95% test_cardinality_violation, test_nested_shape
trails-reason ≥ 80% test_rdfs_subclass_entailment, test_inferred_cache_invalidation
trails-policy ≥ 95% test_permit_on_valid_vc, test_deny_with_diagnostics
trails-prov ≥ 90% test_prov_activity_chain, test_derivation_link, test_ect_emission_L1, test_ect_emission_L2_signed, test_ect_emission_L3_anchored, test_prov_ect_roundtrip
trails-caps ≥ 90% test_mcp_projection, test_openapi_projection, test_jsonld_canonical
trails-identity ≥ 90% test_did_key_resolve, test_did_web_resolve_tls_pinned, test_act_verify, test_act_replay_rejected, test_act_attenuation_to_biscuit, test_alg_none_rejected
trails-cost ≥ 95% test_budget_exceeded, test_envelope_close_actual

Tooling: cargo test, cargo tarpaulin for coverage, proptest for property tests (see Band 2).

1.2 Python surface unit tests

Module Coverage target Example
decorators.py ≥ 90% test_capability_registration, test_shape_field_mapping
mcp_server.py ≥ 80% test_tools_list, test_tools_call_dispatch
http_adapter.py ≥ 80% test_content_negotiation, test_openapi_generation
render.py ≥ 90% test_bimodal_markdown, test_bimodal_jsonld
cli/* ≥ 70% test_new_scaffold, test_onto_export, test_trails_g_resource, test_trails_dev_autoreload
testing.py ≥ 80% test_assert_shape, test_assert_provenance_chain, test_fake_act

Tooling: pytest, pytest-cov, hypothesis for property tests.

1.3 FFI boundary tests

Guarantees across the PyO3 edge:

  • Panic containment — injected panic! at each kernel entry point (trails_core.graph, trails_core.shapes, trails_core.policy, trails_core.prov, trails_core.caps, trails_core.cost, trails_core.identity) surfaces as a Python KernelError; the Python process does not abort (NFR-Sec14).
  • Structured error preservation — every Rust TrailsError variant maps to the documented Python exception type (ValidationError, AuthenticationError, AuthorizationError, PreconditionError, BudgetExceededError, HandlerError, BackendError, KernelError) with .field, .constraint, .iri context fields populated.
  • ABI3 wheel — the same binary imports under Python 3.11, 3.12, 3.13 (covered under Band 4 ABI3 conformance).

Band 2 — Property tests

Scope: invariants over random inputs. Catches edge cases unit tests miss.

2.1 Graph invariants

  • Idempotent load: load(g, d) then load(g, d) yields identical graph state.
  • Transaction atomicity: on rollback, no triple from the aborted transaction is visible.
  • Named-graph isolation: triples in graph A never appear in queries restricted to graph B.
  • SPARQL round-trip: CONSTRUCT of a graph, re-INSERTed into an empty store, equals the original (modulo blank node renaming).

2.2 Shape invariants

  • Monotonicity: adding a more-permissive shape cannot cause previously-valid data to fail.
  • Composition: validate(data, [A, B])validate(data, A) ∧ validate(data, B).
  • Cardinality: any data satisfying min=n, max=m triggers no cardinality violation; data with n-1 or m+1 occurrences triggers violation with correct diagnostic.

2.3 Policy invariants

  • Determinism: same request → same decision, always.
  • Explicit-deny wins: presence of an explicit forbid always overrides any permit.
  • Context-freeness: decisions depend only on declared inputs (principal, action, resource, context) — no hidden state.

2.4 Provenance invariants

  • Conservation: every generated entity is traceable back through used entities to original inputs (or declared external sources).
  • Monotonicity: PROV triples are append-only; no existing activity record is ever mutated.
  • Signed integrity (when enabled): tampered PROV records fail signature verification.

2.5 Capability invariants

  • Manifest consistency: MCP projection, JSON-LD canonical, and OpenAPI projection all describe the same input/output shape.
  • Version coherence: capability.version is SemVer; deprecates always references a valid prior version.
  • Idempotency: capabilities declared idempotent=True produce identical observable state on repeated invocation with the same idempotency key.

Band 3 — Integration tests

Scope: multiple subsystems, real backends (no mocks). Process-level.

3.1 End-to-end capability invocation

  • Python @capability → PyO3 → Rust kernel → Oxigraph → response envelope
  • Validation failure at input → correct 400 + ValidationReport with field/constraint/value
  • Policy deny → correct 403 + decision log entry
  • Handler exception → correct 500 + trace ID; no partial writes visible
  • Budget exceeded → correct 429 + BudgetStatus
  • Successful invocation → response envelope with payload, provenance IRI, cost, consent receipt, trace ID

3.2 MCP integration

  • Fresh Trails app + bundled MCP SSE server
  • External MCP client (reference: Claude Code CLI in test mode) connects
  • tools/list returns correct tool schemas
  • tools/call invokes capabilities with correct argument binding
  • Error mapping (MCP errors ↔ framework errors) correct

3.3 HTTP integration

  • FastAPI mount serves OpenAPI at /openapi.json
  • Content negotiation: Accept: text/markdown → Markdown; application/ld+json → JSON-LD; application/json → plain JSON
  • Capability manifest at /.well-known/capabilities is valid JSON-LD with correct context

3.4 Graph backend integration

  • Oxigraph embedded: CRUD, named graphs, transactions
  • Qlever remote: CRUD (write via SPARQL UPDATE), large-result streaming
  • Fuseki: CRUD, federated query
  • Switching backend in trails.toml produces identical application behavior for a fixed test suite (the "backend conformance suite" — see Band 4)

3.5 Identity + policy integration

  • Anonymous request → policy deny (unless explicitly permitted)
  • DID-signed biscuit → principal resolved, policy evaluated with correct principal context
  • VC-gated precondition → VC verified, preconditions pass/fail correctly
  • Biscuit attenuation: parent token authorizes subset; attenuated child cannot exceed

3.6 Cost + budget integration

  • Per-capability envelope opened/closed
  • Budget-exhausted request rejected with 429 before handler runs
  • Anomaly detection fires when p99 latency exceeds configured threshold multiplier

3.7 Cross-subsystem integration (DispatchCoordinator)

  • DispatchCoordinator rolls back all graph writes when output validation fails after a successful handler body (NFR-Rel1).
  • Provenance is emitted on every deny branch (trails:outcome "denied" | "validation_failed" | "budget_exceeded" | "handler_error"), not only on success — verifies the always-on posture of ADR-0009 under policy-deny paths.
  • Cost envelope is closed with actual=estimate on any abort; no envelope leaks across a failed dispatch.
  • TOCTOU: a concurrent graph write mid-dispatch does not change the view seen by policy or the handler (snapshot isolation; NFR-Sec10).

3.8 Graceful shutdown

  • On SIGTERM, in-flight handlers drain within the configured window (default 30 s); the process exits cleanly. New invocations started after SIGTERM are rejected with 503 (NFR-Rel4).

Tooling: pytest with testcontainers-python for Qlever/Fuseki/Postgres; cargo test with #[tokio::test] for Rust integration.


Band 4 — Conformance tests

Scope: "does this adapter / implementation meet the Trails spec?"

4.1 Graph backend conformance suite

A Python test suite (trails-conformance-graph) any GraphStore implementation must pass:

  • 50 SPARQL queries covering SELECT, CONSTRUCT, ASK, DESCRIBE, federated SERVICE
  • Named-graph write/read/delete
  • Transaction commit/rollback
  • Large-result streaming (≥ 100k rows)
  • Unicode + datatype edge cases (langString, xsd:dateTimeStamp, blank nodes)

All shipped adapters (Oxigraph, Qlever, Fuseki) must pass. Third-party adapters achieve "conformant" badge by passing.

4.2 Capability projection conformance

Given a fixed capability descriptor: - MCP projection passes MCP schema validator - OpenAPI projection passes OpenAPI 3.1 validator - JSON-LD canonical form passes JSON-LD 1.1 processor round-trip

4.3 Provenance conformance

Generated PROV triples round-trip through: - Apache Jena's PROV validator - pyprov / provpy round-trip - SPARQL queries matching standard PROV-O patterns produce expected results

4.4 Policy engine conformance

Given a reference Cedar policy set, decisions match the Cedar reference implementation byte-for-byte, including diagnostic output.

4.5 ACT / ECT conformance

Normative behaviour per draft-nennemann-act-00 (ACT) and draft-nennemann-wimse-ect (ECT):

  • Tokens produced by trails-identity validate under the published ACT / ECT draft test vectors (JSON samples committed under tests/conformance/act-ect/).
  • Replay protection — a previously-seen jti (within TTL) is rejected; the rejection path is logged and counted (NFR-Sec5).
  • Algorithm downgrade — tokens with alg=none, HS*, or any algorithm outside {EdDSA, ES256, ES384} are rejected at parse (NFR-Sec6).
  • L1/L2/L3 emission@capability(assurance=...) drives ECT assurance level; L1 is unsigned JSON, L2 is JOSE-signed, L3 is JOSE-signed + external-ledger anchored. Round-trip via pyprov and a reference ECT verifier.
  • L1 export containment (NFR-Sec13) — an L1 ECT export to an external sink fails unless the operator opt-in flag is set.

4.6 ABI3 wheel conformance

The compiled Rust wheel imports under Python 3.11, 3.12, 3.13 with the same binary; trails_core smoke tests pass on each.

4.7 License gate

cargo-deny check licenses + pip-licenses --fail-on GPL green. Required for NFR-Lic1.


Band 5 — Agent-simulation tests

Scope: nondeterministic but shape-pinned. The only band that uses LLMs.

5.1 trails sim

Local agent (cheap model — Haiku or equivalent) configured with: - The app's MCP tool list - Random plausible inputs generated from shape schemas - Budget cap per simulation run (default $0.50)

Invokes capabilities 100+ times, checking: - Shape pinning: every response matches declared output shape - Provenance presence: every response includes a resolvable provenance IRI - Cost envelope accuracy: actual costs within 3x of estimates - Policy coverage: every policy branch exercised at least once - No partial writes: failed invocations leave graph untouched

5.2 Golden-shape assertions

Instead of string-equality assertions that fail on paraphrase:

from trails.testing import assert_shape, assert_provenance_chain

result = capability.invoke("patient.intake", valid_input)

assert_shape(result.payload, Patient)           # SHACL-validated
assert_provenance_chain(
    result.provenance,
    expected_activities=["patient.intake"],
    expected_agents=[test_principal.did],
)

5.3 Testing-helper API surface (v1)

Canonical helpers under trails.testing that example code and §5.2 reference:

  • trails.testing.assert_shape(obj, Shape) — SHACL-pinned assertion (cf. §3.9 of 03-design-spec).
  • trails.testing.assert_provenance_chain(prov_iri, expected_activities=[...], expected_agents=[...]) — verify PROV-O chain.
  • trails.testing.fake_act(principal, capabilities, vc=None) — issue a test ACT mandate for use in tests.
  • trails.testing.fake_commit(author_did, parents=[], diff="...") — construct a test-fixture Commit entity (analogous constructors exist for other shapes).
  • trails.testing.AuthorizationError — raised when a capability is invoked without a valid ACT/policy.

5.4 Adversarial simulation

Agent is given "red team" prompt: try to bypass preconditions, exhaust budgets, generate malformed inputs. Framework must: - Reject all bypass attempts via policy / validation - Limit damage to budget caps - Log every attempt in decision log for review

Tooling: trails sim (author-built), Anthropic SDK with cached prompts, golden-shape assertion library.


Cross-cutting test concerns

Coverage targets (summary)

Layer Unit Integration Notes
Rust kernel ≥ 90% n/a cargo tarpaulin
Python surface ≥ 85% n/a pytest-cov
E2E flows n/a 100% of documented capabilities required before release
Policy decisions 100% branch coverage via agent-sim Cedar supports branch tracing

Performance regression tests

Run in CI on every PR:

  • Capability dispatch overhead ≤ 10 ms p95 (NFR-Perf1)
  • 100-triple write ≤ 20 ms p95 (NFR-Perf2)
  • tools/list ≤ 50 ms (NFR-Perf4)
  • Regression threshold: 20% slower fails the build

Tooling: criterion (Rust), pytest-benchmark (Python), CI workflow publishes flamegraphs on regression.

Security test scenarios

Run in CI weekly:

  • SPARQL injection: malicious input strings attempting to escape parameterization; framework must sanitize.
  • Biscuit forgery: tampered tokens; kernel must reject.
  • DID spoofing: synthetic DID documents; framework must verify against pinned trust roots.
  • Policy bypass: attempt to invoke capability without PEP check (shouldn't be possible by construction; verified via fuzzer).
  • Validator bypass: malformed RDF attempting to reach graph unvalidated.

Chaos / fault injection (v1+)

  • Kill Oxigraph mid-transaction — verify no partial writes
  • Network partition between Python surface and remote Qlever — verify graceful degradation
  • OOM simulation — verify cost envelopes limit damage

CI matrix

Required cells (must be green for release; matches NFR-Port1 P0):

Axis Values
OS / arch ubuntu-22.04 (x86_64), ubuntu-22.04-arm64 (aarch64), macos-13 (x86_64), macos-14 (aarch64); windows-2022 (x86_64) advisory-only
Python 3.11, 3.12, 3.13
Rust stable, beta, MSRV (fixed)
Graph backend Oxigraph-embedded, Qlever (testcontainer), Fuseki (testcontainer)

All four Linux + macOS arch cells are P0 — NFR-Port1 release gate fails if any of the four is red. windows-2022 is P1 (NFR-Port2). GitHub Actions; per-PR: unit + integration + conformance on the P0 set. Nightly: agent-sim, security, performance regression, plus the full matrix including advisory cells.

Test data management

  • Fixtures: ontology bundles under tests/fixtures/ontologies/, versioned.
  • Golden files: PROV subgraphs under tests/golden/, reviewed on change.
  • Secrets: never in fixtures; test DIDs/biscuits generated per-run.
  • Determinism: any LLM-touching test is either agent-sim (nondeterministic, shape-pinned) or replays cached responses.

Tool Band Purpose
trails-conformance-graph 4 Backend conformance suite
trails-conformance-caps 4 Capability projection conformance
trails sim 5 Agent-based testing
trails.testing.assert_shape 5 Shape-pinned assertions
trails.testing.assert_provenance_chain 5 Provenance assertions
Cedar policy test harness 4 .cedar-test file format + runner
Golden-shape fixture generator 5 Captures example responses per capability

Release gates

Before any tagged release:

  1. Band 1–3 must pass on all CI matrix cells.
  2. Band 4 conformance must pass for all shipped backends.
  3. Band 5 agent-sim must run clean on all example apps for ≥ 100 iterations.
  4. Performance regression budget not exceeded.
  5. Security scenarios re-run manually; findings triaged.
  6. At least one example app redeployed with the new version.

Release gates by milestone

Tight CI min-sets per milestone.

M0 min-set (5 items)

# Gate Source
M0-1 examples/hello.py boots + one successful MCP tools/call Band 3.2 subset
M0-2 trails-graph::test_named_graph_isolation + test_snapshot_rollback pass Band 1.1
M0-3 PyO3 FFI round-trip bench recorded to bench/m0-baseline.json Band 1.3
M0-4 cargo test + pytest examples/ green on ubuntu-22.04-x86_64 and macos-14-aarch64 CI matrix
M0-5 License-header gate green (cargo-deny, pip-licenses) Band 4.7

M1 min-set (9 items)

# Gate Source
M1-1 All M0 gates still green
M1-2 Band 1.1 + 1.2 coverage targets for trails-shapes, trails-prov, trails-caps, trails-graph Band 1
M1-3 Band 2 §2.1, §2.2, §2.4, §2.5 pass Band 2
M1-4 Band 3.1 + 3.2 pass; Band 3.3 covers the minimal HTTP adapter subset Band 3
M1-5 Band 4.1 Oxigraph conformance passing Band 4
M1-6 Band 4.3 PROV conformance passing Band 4
M1-7 NFR-Erg1 LLOC gate on examples/hello.py (radon raw ≤ 10 + MCP probe) Band 3.7-adjacent
M1-8 NFR-Perf1 + NFR-Perf2 thresholds enforced on ubuntu-22.04-x86_64 Performance regression
M1-9 Dogfood service smoke suite reports green Trails invocation cross-repo CI