06 — Test Plan¶

Test pyramid (adjusted for agentic-semantic)¶

                     ▲
                    / \
                   / E \                 ← Agent-sim (nondeterministic, golden-shape)
                  /─────\
                 /  C    \               ← Conformance (cross-backend, cross-runtime)
                /─────────\
               /     I     \             ← Integration (process-level, real backends)
              /─────────────\
             /   Properties   \          ← Property (QuickCheck-style invariants)
            /─────────────────\
           /       Unit         \        ← Unit (fast, pure, per-module)
          /─────────────────────\

Five bands, each with distinct failure modes and tools.

Band 1 — Unit tests¶

Scope: per-function / per-trait-method. Pure where possible.

1.1 Rust kernel unit tests¶

Crate	Coverage target	Example
`trails-graph`	≥ 90%	`test_named_graph_isolation`, `test_snapshot_rollback`
`trails-shapes`	≥ 95%	`test_cardinality_violation`, `test_nested_shape`
`trails-reason`	≥ 80%	`test_rdfs_subclass_entailment`, `test_inferred_cache_invalidation`
`trails-policy`	≥ 95%	`test_permit_on_valid_vc`, `test_deny_with_diagnostics`
`trails-prov`	≥ 90%	`test_prov_activity_chain`, `test_derivation_link`, `test_ect_emission_L1`, `test_ect_emission_L2_signed`, `test_ect_emission_L3_anchored`, `test_prov_ect_roundtrip`
`trails-caps`	≥ 90%	`test_mcp_projection`, `test_openapi_projection`, `test_jsonld_canonical`
`trails-identity`	≥ 90%	`test_did_key_resolve`, `test_did_web_resolve_tls_pinned`, `test_act_verify`, `test_act_replay_rejected`, `test_act_attenuation_to_biscuit`, `test_alg_none_rejected`
`trails-cost`	≥ 95%	`test_budget_exceeded`, `test_envelope_close_actual`

Tooling: cargo test, cargo tarpaulin for coverage, proptest for property tests (see Band 2).

1.2 Python surface unit tests¶

Module	Coverage target	Example
`decorators.py`	≥ 90%	`test_capability_registration`, `test_shape_field_mapping`
`mcp_server.py`	≥ 80%	`test_tools_list`, `test_tools_call_dispatch`
`http_adapter.py`	≥ 80%	`test_content_negotiation`, `test_openapi_generation`
`render.py`	≥ 90%	`test_bimodal_markdown`, `test_bimodal_jsonld`
`cli/*`	≥ 70%	`test_new_scaffold`, `test_onto_export`, `test_trails_g_resource`, `test_trails_dev_autoreload`
`testing.py`	≥ 80%	`test_assert_shape`, `test_assert_provenance_chain`, `test_fake_act`

Tooling: pytest, pytest-cov, hypothesis for property tests.

1.3 FFI boundary tests¶

Guarantees across the PyO3 edge:

Panic containment — injected panic! at each kernel entry point (trails_core.graph, trails_core.shapes, trails_core.policy, trails_core.prov, trails_core.caps, trails_core.cost, trails_core.identity) surfaces as a Python KernelError; the Python process does not abort (NFR-Sec14).
Structured error preservation — every Rust TrailsError variant maps to the documented Python exception type (ValidationError, AuthenticationError, AuthorizationError, PreconditionError, BudgetExceededError, HandlerError, BackendError, KernelError) with .field, .constraint, .iri context fields populated.
ABI3 wheel — the same binary imports under Python 3.11, 3.12, 3.13 (covered under Band 4 ABI3 conformance).

Band 2 — Property tests¶

Scope: invariants over random inputs. Catches edge cases unit tests miss.

2.1 Graph invariants¶

Idempotent load: load(g, d) then load(g, d) yields identical graph state.
Transaction atomicity: on rollback, no triple from the aborted transaction is visible.
Named-graph isolation: triples in graph A never appear in queries restricted to graph B.
SPARQL round-trip: CONSTRUCT of a graph, re-INSERTed into an empty store, equals the original (modulo blank node renaming).

2.2 Shape invariants¶

Monotonicity: adding a more-permissive shape cannot cause previously-valid data to fail.
Composition: validate(data, [A, B]) ≡ validate(data, A) ∧ validate(data, B).
Cardinality: any data satisfying min=n, max=m triggers no cardinality violation; data with n-1 or m+1 occurrences triggers violation with correct diagnostic.

2.3 Policy invariants¶

Determinism: same request → same decision, always.
Explicit-deny wins: presence of an explicit forbid always overrides any permit.
Context-freeness: decisions depend only on declared inputs (principal, action, resource, context) — no hidden state.

2.4 Provenance invariants¶

Conservation: every generated entity is traceable back through used entities to original inputs (or declared external sources).
Monotonicity: PROV triples are append-only; no existing activity record is ever mutated.
Signed integrity (when enabled): tampered PROV records fail signature verification.

2.5 Capability invariants¶

Manifest consistency: MCP projection, JSON-LD canonical, and OpenAPI projection all describe the same input/output shape.
Version coherence: capability.version is SemVer; deprecates always references a valid prior version.
Idempotency: capabilities declared idempotent=True produce identical observable state on repeated invocation with the same idempotency key.

Band 3 — Integration tests¶

Scope: multiple subsystems, real backends (no mocks). Process-level.

3.1 End-to-end capability invocation¶

Python @capability → PyO3 → Rust kernel → Oxigraph → response envelope
Validation failure at input → correct 400 + ValidationReport with field/constraint/value
Policy deny → correct 403 + decision log entry
Handler exception → correct 500 + trace ID; no partial writes visible
Budget exceeded → correct 429 + BudgetStatus
Successful invocation → response envelope with payload, provenance IRI, cost, consent receipt, trace ID

3.2 MCP integration¶

Fresh Trails app + bundled MCP SSE server
External MCP client (reference: Claude Code CLI in test mode) connects
tools/list returns correct tool schemas
tools/call invokes capabilities with correct argument binding
Error mapping (MCP errors ↔ framework errors) correct

3.3 HTTP integration¶

FastAPI mount serves OpenAPI at /openapi.json
Content negotiation: Accept: text/markdown → Markdown; application/ld+json → JSON-LD; application/json → plain JSON
Capability manifest at /.well-known/capabilities is valid JSON-LD with correct context

3.4 Graph backend integration¶

Oxigraph embedded: CRUD, named graphs, transactions
Qlever remote: CRUD (write via SPARQL UPDATE), large-result streaming
Fuseki: CRUD, federated query
Switching backend in trails.toml produces identical application behavior for a fixed test suite (the "backend conformance suite" — see Band 4)

3.5 Identity + policy integration¶

Anonymous request → policy deny (unless explicitly permitted)
DID-signed biscuit → principal resolved, policy evaluated with correct principal context
VC-gated precondition → VC verified, preconditions pass/fail correctly
Biscuit attenuation: parent token authorizes subset; attenuated child cannot exceed

3.6 Cost + budget integration¶

Per-capability envelope opened/closed
Budget-exhausted request rejected with 429 before handler runs
Anomaly detection fires when p99 latency exceeds configured threshold multiplier

3.7 Cross-subsystem integration (DispatchCoordinator)¶

DispatchCoordinator rolls back all graph writes when output validation fails after a successful handler body (NFR-Rel1).
Provenance is emitted on every deny branch (trails:outcome "denied" | "validation_failed" | "budget_exceeded" | "handler_error"), not only on success — verifies the always-on posture of ADR-0009 under policy-deny paths.
Cost envelope is closed with actual=estimate on any abort; no envelope leaks across a failed dispatch.
TOCTOU: a concurrent graph write mid-dispatch does not change the view seen by policy or the handler (snapshot isolation; NFR-Sec10).

3.8 Graceful shutdown¶

On SIGTERM, in-flight handlers drain within the configured window (default 30 s); the process exits cleanly. New invocations started after SIGTERM are rejected with 503 (NFR-Rel4).

Tooling: pytest with testcontainers-python for Qlever/Fuseki/Postgres; cargo test with #[tokio::test] for Rust integration.

Band 4 — Conformance tests¶

Scope: "does this adapter / implementation meet the Trails spec?"

4.1 Graph backend conformance suite¶

A Python test suite (trails-conformance-graph) any GraphStore implementation must pass:

50 SPARQL queries covering SELECT, CONSTRUCT, ASK, DESCRIBE, federated SERVICE
Named-graph write/read/delete
Transaction commit/rollback
Large-result streaming (≥ 100k rows)
Unicode + datatype edge cases (langString, xsd:dateTimeStamp, blank nodes)

All shipped adapters (Oxigraph, Qlever, Fuseki) must pass. Third-party adapters achieve "conformant" badge by passing.

4.2 Capability projection conformance¶

Given a fixed capability descriptor: - MCP projection passes MCP schema validator - OpenAPI projection passes OpenAPI 3.1 validator - JSON-LD canonical form passes JSON-LD 1.1 processor round-trip

4.3 Provenance conformance¶

Generated PROV triples round-trip through: - Apache Jena's PROV validator - pyprov / provpy round-trip - SPARQL queries matching standard PROV-O patterns produce expected results

4.4 Policy engine conformance¶

Given a reference Cedar policy set, decisions match the Cedar reference implementation byte-for-byte, including diagnostic output.

4.5 ACT / ECT conformance¶

Normative behaviour per draft-nennemann-act-00 (ACT) and draft-nennemann-wimse-ect (ECT):

Tokens produced by trails-identity validate under the published ACT / ECT draft test vectors (JSON samples committed under tests/conformance/act-ect/).
Replay protection — a previously-seen jti (within TTL) is rejected; the rejection path is logged and counted (NFR-Sec5).
Algorithm downgrade — tokens with alg=none, HS*, or any algorithm outside {EdDSA, ES256, ES384} are rejected at parse (NFR-Sec6).
L1/L2/L3 emission — @capability(assurance=...) drives ECT assurance level; L1 is unsigned JSON, L2 is JOSE-signed, L3 is JOSE-signed + external-ledger anchored. Round-trip via pyprov and a reference ECT verifier.
L1 export containment (NFR-Sec13) — an L1 ECT export to an external sink fails unless the operator opt-in flag is set.

4.6 ABI3 wheel conformance¶

The compiled Rust wheel imports under Python 3.11, 3.12, 3.13 with the same binary; trails_core smoke tests pass on each.

4.7 License gate¶

cargo-deny check licenses + pip-licenses --fail-on GPL green. Required for NFR-Lic1.

Band 5 — Agent-simulation tests¶

Scope: nondeterministic but shape-pinned. The only band that uses LLMs.

5.1 `trails sim`¶

Local agent (cheap model — Haiku or equivalent) configured with: - The app's MCP tool list - Random plausible inputs generated from shape schemas - Budget cap per simulation run (default $0.50)

Invokes capabilities 100+ times, checking: - Shape pinning: every response matches declared output shape - Provenance presence: every response includes a resolvable provenance IRI - Cost envelope accuracy: actual costs within 3x of estimates - Policy coverage: every policy branch exercised at least once - No partial writes: failed invocations leave graph untouched

5.2 Golden-shape assertions¶

Instead of string-equality assertions that fail on paraphrase:

from trails.testing import assert_shape, assert_provenance_chain

result = capability.invoke("patient.intake", valid_input)

assert_shape(result.payload, Patient)           # SHACL-validated
assert_provenance_chain(
    result.provenance,
    expected_activities=["patient.intake"],
    expected_agents=[test_principal.did],
)

5.3 Testing-helper API surface (v1)¶

Canonical helpers under trails.testing that example code and §5.2 reference:

trails.testing.assert_shape(obj, Shape) — SHACL-pinned assertion (cf. §3.9 of 03-design-spec).
trails.testing.assert_provenance_chain(prov_iri, expected_activities=[...], expected_agents=[...]) — verify PROV-O chain.
trails.testing.fake_act(principal, capabilities, vc=None) — issue a test ACT mandate for use in tests.
trails.testing.fake_commit(author_did, parents=[], diff="...") — construct a test-fixture Commit entity (analogous constructors exist for other shapes).
trails.testing.AuthorizationError — raised when a capability is invoked without a valid ACT/policy.

5.4 Adversarial simulation¶

Agent is given "red team" prompt: try to bypass preconditions, exhaust budgets, generate malformed inputs. Framework must: - Reject all bypass attempts via policy / validation - Limit damage to budget caps - Log every attempt in decision log for review

Tooling: trails sim (author-built), Anthropic SDK with cached prompts, golden-shape assertion library.

Cross-cutting test concerns¶

Coverage targets (summary)¶

Layer	Unit	Integration	Notes
Rust kernel	≥ 90%	n/a	`cargo tarpaulin`
Python surface	≥ 85%	n/a	`pytest-cov`
E2E flows	n/a	100% of documented capabilities	required before release
Policy decisions	100% branch coverage	via agent-sim	Cedar supports branch tracing

Performance regression tests¶

Run in CI on every PR:

Capability dispatch overhead ≤ 10 ms p95 (NFR-Perf1)
100-triple write ≤ 20 ms p95 (NFR-Perf2)
tools/list ≤ 50 ms (NFR-Perf4)
Regression threshold: 20% slower fails the build

Tooling: criterion (Rust), pytest-benchmark (Python), CI workflow publishes flamegraphs on regression.

Security test scenarios¶

Run in CI weekly:

SPARQL injection: malicious input strings attempting to escape parameterization; framework must sanitize.
Biscuit forgery: tampered tokens; kernel must reject.
DID spoofing: synthetic DID documents; framework must verify against pinned trust roots.
Policy bypass: attempt to invoke capability without PEP check (shouldn't be possible by construction; verified via fuzzer).
Validator bypass: malformed RDF attempting to reach graph unvalidated.

Chaos / fault injection (v1+)¶

Kill Oxigraph mid-transaction — verify no partial writes
Network partition between Python surface and remote Qlever — verify graceful degradation
OOM simulation — verify cost envelopes limit damage

CI matrix¶

Required cells (must be green for release; matches NFR-Port1 P0):

Axis	Values
OS / arch	`ubuntu-22.04` (x86_64), `ubuntu-22.04-arm64` (aarch64), `macos-13` (x86_64), `macos-14` (aarch64); `windows-2022` (x86_64) advisory-only
Python	3.11, 3.12, 3.13
Rust	stable, beta, MSRV (fixed)
Graph backend	Oxigraph-embedded, Qlever (testcontainer), Fuseki (testcontainer)

All four Linux + macOS arch cells are P0 — NFR-Port1 release gate fails if any of the four is red. windows-2022 is P1 (NFR-Port2). GitHub Actions; per-PR: unit + integration + conformance on the P0 set. Nightly: agent-sim, security, performance regression, plus the full matrix including advisory cells.

Test data management¶

Fixtures: ontology bundles under tests/fixtures/ontologies/, versioned.
Golden files: PROV subgraphs under tests/golden/, reviewed on change.
Secrets: never in fixtures; test DIDs/biscuits generated per-run.
Determinism: any LLM-touching test is either agent-sim (nondeterministic, shape-pinned) or replays cached responses.

Tool	Band	Purpose
`trails-conformance-graph`	4	Backend conformance suite
`trails-conformance-caps`	4	Capability projection conformance
`trails sim`	5	Agent-based testing
`trails.testing.assert_shape`	5	Shape-pinned assertions
`trails.testing.assert_provenance_chain`	5	Provenance assertions
Cedar policy test harness	4	`.cedar-test` file format + runner
Golden-shape fixture generator	5	Captures example responses per capability

Release gates¶

Before any tagged release:

Band 1–3 must pass on all CI matrix cells.
Band 4 conformance must pass for all shipped backends.
Band 5 agent-sim must run clean on all example apps for ≥ 100 iterations.
Performance regression budget not exceeded.
Security scenarios re-run manually; findings triaged.
At least one example app redeployed with the new version.

Release gates by milestone¶

Tight CI min-sets per milestone.

M0 min-set (5 items)¶

#	Gate	Source
M0-1	`examples/hello.py` boots + one successful MCP `tools/call`	Band 3.2 subset
M0-2	`trails-graph::test_named_graph_isolation` + `test_snapshot_rollback` pass	Band 1.1
M0-3	PyO3 FFI round-trip bench recorded to `bench/m0-baseline.json`	Band 1.3
M0-4	`cargo test` + `pytest examples/` green on `ubuntu-22.04-x86_64` and `macos-14-aarch64`	CI matrix
M0-5	License-header gate green (`cargo-deny`, `pip-licenses`)	Band 4.7

M1 min-set (9 items)¶

#	Gate	Source
M1-1	All M0 gates still green	—
M1-2	Band 1.1 + 1.2 coverage targets for `trails-shapes`, `trails-prov`, `trails-caps`, `trails-graph`	Band 1
M1-3	Band 2 §2.1, §2.2, §2.4, §2.5 pass	Band 2
M1-4	Band 3.1 + 3.2 pass; Band 3.3 covers the minimal HTTP adapter subset	Band 3
M1-5	Band 4.1 Oxigraph conformance passing	Band 4
M1-6	Band 4.3 PROV conformance passing	Band 4
M1-7	NFR-Erg1 LLOC gate on `examples/hello.py` (`radon raw` ≤ 10 + MCP probe)	Band 3.7-adjacent
M1-8	NFR-Perf1 + NFR-Perf2 thresholds enforced on `ubuntu-22.04-x86_64`	Performance regression
M1-9	Dogfood service smoke suite reports green Trails invocation	cross-repo CI