06 — Test Plan¶
Test pyramid (adjusted for agentic-semantic)¶
▲
/ \
/ E \ ← Agent-sim (nondeterministic, golden-shape)
/─────\
/ C \ ← Conformance (cross-backend, cross-runtime)
/─────────\
/ I \ ← Integration (process-level, real backends)
/─────────────\
/ Properties \ ← Property (QuickCheck-style invariants)
/─────────────────\
/ Unit \ ← Unit (fast, pure, per-module)
/─────────────────────\
Five bands, each with distinct failure modes and tools.
Band 1 — Unit tests¶
Scope: per-function / per-trait-method. Pure where possible.
1.1 Rust kernel unit tests¶
| Crate | Coverage target | Example |
|---|---|---|
trails-graph |
≥ 90% | test_named_graph_isolation, test_snapshot_rollback |
trails-shapes |
≥ 95% | test_cardinality_violation, test_nested_shape |
trails-reason |
≥ 80% | test_rdfs_subclass_entailment, test_inferred_cache_invalidation |
trails-policy |
≥ 95% | test_permit_on_valid_vc, test_deny_with_diagnostics |
trails-prov |
≥ 90% | test_prov_activity_chain, test_derivation_link, test_ect_emission_L1, test_ect_emission_L2_signed, test_ect_emission_L3_anchored, test_prov_ect_roundtrip |
trails-caps |
≥ 90% | test_mcp_projection, test_openapi_projection, test_jsonld_canonical |
trails-identity |
≥ 90% | test_did_key_resolve, test_did_web_resolve_tls_pinned, test_act_verify, test_act_replay_rejected, test_act_attenuation_to_biscuit, test_alg_none_rejected |
trails-cost |
≥ 95% | test_budget_exceeded, test_envelope_close_actual |
Tooling: cargo test, cargo tarpaulin for coverage, proptest for property tests (see Band 2).
1.2 Python surface unit tests¶
| Module | Coverage target | Example |
|---|---|---|
decorators.py |
≥ 90% | test_capability_registration, test_shape_field_mapping |
mcp_server.py |
≥ 80% | test_tools_list, test_tools_call_dispatch |
http_adapter.py |
≥ 80% | test_content_negotiation, test_openapi_generation |
render.py |
≥ 90% | test_bimodal_markdown, test_bimodal_jsonld |
cli/* |
≥ 70% | test_new_scaffold, test_onto_export, test_trails_g_resource, test_trails_dev_autoreload |
testing.py |
≥ 80% | test_assert_shape, test_assert_provenance_chain, test_fake_act |
Tooling: pytest, pytest-cov, hypothesis for property tests.
1.3 FFI boundary tests¶
Guarantees across the PyO3 edge:
- Panic containment — injected
panic!at each kernel entry point (trails_core.graph,trails_core.shapes,trails_core.policy,trails_core.prov,trails_core.caps,trails_core.cost,trails_core.identity) surfaces as a PythonKernelError; the Python process does not abort (NFR-Sec14). - Structured error preservation — every Rust
TrailsErrorvariant maps to the documented Python exception type (ValidationError,AuthenticationError,AuthorizationError,PreconditionError,BudgetExceededError,HandlerError,BackendError,KernelError) with.field,.constraint,.iricontext fields populated. - ABI3 wheel — the same binary imports under Python 3.11, 3.12, 3.13 (covered under Band 4 ABI3 conformance).
Band 2 — Property tests¶
Scope: invariants over random inputs. Catches edge cases unit tests miss.
2.1 Graph invariants¶
- Idempotent load:
load(g, d)thenload(g, d)yields identical graph state. - Transaction atomicity: on rollback, no triple from the aborted transaction is visible.
- Named-graph isolation: triples in graph A never appear in queries restricted to graph B.
- SPARQL round-trip:
CONSTRUCTof a graph, re-INSERTed into an empty store, equals the original (modulo blank node renaming).
2.2 Shape invariants¶
- Monotonicity: adding a more-permissive shape cannot cause previously-valid data to fail.
- Composition:
validate(data, [A, B])≡validate(data, A) ∧ validate(data, B). - Cardinality: any data satisfying
min=n, max=mtriggers no cardinality violation; data withn-1orm+1occurrences triggers violation with correct diagnostic.
2.3 Policy invariants¶
- Determinism: same request → same decision, always.
- Explicit-deny wins: presence of an explicit
forbidalways overrides anypermit. - Context-freeness: decisions depend only on declared inputs (principal, action, resource, context) — no hidden state.
2.4 Provenance invariants¶
- Conservation: every
generatedentity is traceable back throughusedentities to original inputs (or declared external sources). - Monotonicity: PROV triples are append-only; no existing activity record is ever mutated.
- Signed integrity (when enabled): tampered PROV records fail signature verification.
2.5 Capability invariants¶
- Manifest consistency: MCP projection, JSON-LD canonical, and OpenAPI projection all describe the same input/output shape.
- Version coherence:
capability.versionis SemVer;deprecatesalways references a valid prior version. - Idempotency: capabilities declared
idempotent=Trueproduce identical observable state on repeated invocation with the same idempotency key.
Band 3 — Integration tests¶
Scope: multiple subsystems, real backends (no mocks). Process-level.
3.1 End-to-end capability invocation¶
- Python
@capability→ PyO3 → Rust kernel → Oxigraph → response envelope - Validation failure at input → correct 400 + ValidationReport with field/constraint/value
- Policy deny → correct 403 + decision log entry
- Handler exception → correct 500 + trace ID; no partial writes visible
- Budget exceeded → correct 429 + BudgetStatus
- Successful invocation → response envelope with payload, provenance IRI, cost, consent receipt, trace ID
3.2 MCP integration¶
- Fresh Trails app + bundled MCP SSE server
- External MCP client (reference: Claude Code CLI in test mode) connects
tools/listreturns correct tool schemastools/callinvokes capabilities with correct argument binding- Error mapping (MCP errors ↔ framework errors) correct
3.3 HTTP integration¶
- FastAPI mount serves OpenAPI at
/openapi.json - Content negotiation:
Accept: text/markdown→ Markdown;application/ld+json→ JSON-LD;application/json→ plain JSON - Capability manifest at
/.well-known/capabilitiesis valid JSON-LD with correct context
3.4 Graph backend integration¶
- Oxigraph embedded: CRUD, named graphs, transactions
- Qlever remote: CRUD (write via SPARQL UPDATE), large-result streaming
- Fuseki: CRUD, federated query
- Switching backend in
trails.tomlproduces identical application behavior for a fixed test suite (the "backend conformance suite" — see Band 4)
3.5 Identity + policy integration¶
- Anonymous request → policy deny (unless explicitly permitted)
- DID-signed biscuit → principal resolved, policy evaluated with correct principal context
- VC-gated precondition → VC verified, preconditions pass/fail correctly
- Biscuit attenuation: parent token authorizes subset; attenuated child cannot exceed
3.6 Cost + budget integration¶
- Per-capability envelope opened/closed
- Budget-exhausted request rejected with 429 before handler runs
- Anomaly detection fires when p99 latency exceeds configured threshold multiplier
3.7 Cross-subsystem integration (DispatchCoordinator)¶
DispatchCoordinatorrolls back all graph writes when output validation fails after a successful handler body (NFR-Rel1).- Provenance is emitted on every deny branch (
trails:outcome "denied" | "validation_failed" | "budget_exceeded" | "handler_error"), not only on success — verifies the always-on posture of ADR-0009 under policy-deny paths. - Cost envelope is closed with
actual=estimateon any abort; no envelope leaks across a failed dispatch. - TOCTOU: a concurrent graph write mid-dispatch does not change the view seen by policy or the handler (snapshot isolation; NFR-Sec10).
3.8 Graceful shutdown¶
- On
SIGTERM, in-flight handlers drain within the configured window (default 30 s); the process exits cleanly. New invocations started afterSIGTERMare rejected with 503 (NFR-Rel4).
Tooling: pytest with testcontainers-python for Qlever/Fuseki/Postgres; cargo test with #[tokio::test] for Rust integration.
Band 4 — Conformance tests¶
Scope: "does this adapter / implementation meet the Trails spec?"
4.1 Graph backend conformance suite¶
A Python test suite (trails-conformance-graph) any GraphStore implementation must pass:
- 50 SPARQL queries covering SELECT, CONSTRUCT, ASK, DESCRIBE, federated SERVICE
- Named-graph write/read/delete
- Transaction commit/rollback
- Large-result streaming (≥ 100k rows)
- Unicode + datatype edge cases (langString, xsd:dateTimeStamp, blank nodes)
All shipped adapters (Oxigraph, Qlever, Fuseki) must pass. Third-party adapters achieve "conformant" badge by passing.
4.2 Capability projection conformance¶
Given a fixed capability descriptor: - MCP projection passes MCP schema validator - OpenAPI projection passes OpenAPI 3.1 validator - JSON-LD canonical form passes JSON-LD 1.1 processor round-trip
4.3 Provenance conformance¶
Generated PROV triples round-trip through:
- Apache Jena's PROV validator
- pyprov / provpy round-trip
- SPARQL queries matching standard PROV-O patterns produce expected results
4.4 Policy engine conformance¶
Given a reference Cedar policy set, decisions match the Cedar reference implementation byte-for-byte, including diagnostic output.
4.5 ACT / ECT conformance¶
Normative behaviour per draft-nennemann-act-00 (ACT) and draft-nennemann-wimse-ect (ECT):
- Tokens produced by
trails-identityvalidate under the published ACT / ECT draft test vectors (JSON samples committed undertests/conformance/act-ect/). - Replay protection — a previously-seen
jti(within TTL) is rejected; the rejection path is logged and counted (NFR-Sec5). - Algorithm downgrade — tokens with
alg=none, HS*, or any algorithm outside{EdDSA, ES256, ES384}are rejected at parse (NFR-Sec6). - L1/L2/L3 emission —
@capability(assurance=...)drives ECT assurance level; L1 is unsigned JSON, L2 is JOSE-signed, L3 is JOSE-signed + external-ledger anchored. Round-trip viapyprovand a reference ECT verifier. - L1 export containment (NFR-Sec13) — an L1 ECT export to an external sink fails unless the operator opt-in flag is set.
4.6 ABI3 wheel conformance¶
The compiled Rust wheel imports under Python 3.11, 3.12, 3.13 with the same binary; trails_core smoke tests pass on each.
4.7 License gate¶
cargo-deny check licenses + pip-licenses --fail-on GPL green. Required for NFR-Lic1.
Band 5 — Agent-simulation tests¶
Scope: nondeterministic but shape-pinned. The only band that uses LLMs.
5.1 trails sim¶
Local agent (cheap model — Haiku or equivalent) configured with: - The app's MCP tool list - Random plausible inputs generated from shape schemas - Budget cap per simulation run (default $0.50)
Invokes capabilities 100+ times, checking: - Shape pinning: every response matches declared output shape - Provenance presence: every response includes a resolvable provenance IRI - Cost envelope accuracy: actual costs within 3x of estimates - Policy coverage: every policy branch exercised at least once - No partial writes: failed invocations leave graph untouched
5.2 Golden-shape assertions¶
Instead of string-equality assertions that fail on paraphrase:
from trails.testing import assert_shape, assert_provenance_chain
result = capability.invoke("patient.intake", valid_input)
assert_shape(result.payload, Patient) # SHACL-validated
assert_provenance_chain(
result.provenance,
expected_activities=["patient.intake"],
expected_agents=[test_principal.did],
)
5.3 Testing-helper API surface (v1)¶
Canonical helpers under trails.testing that example code and §5.2 reference:
trails.testing.assert_shape(obj, Shape)— SHACL-pinned assertion (cf. §3.9 of 03-design-spec).trails.testing.assert_provenance_chain(prov_iri, expected_activities=[...], expected_agents=[...])— verify PROV-O chain.trails.testing.fake_act(principal, capabilities, vc=None)— issue a test ACT mandate for use in tests.trails.testing.fake_commit(author_did, parents=[], diff="...")— construct a test-fixture Commit entity (analogous constructors exist for other shapes).trails.testing.AuthorizationError— raised when a capability is invoked without a valid ACT/policy.
5.4 Adversarial simulation¶
Agent is given "red team" prompt: try to bypass preconditions, exhaust budgets, generate malformed inputs. Framework must: - Reject all bypass attempts via policy / validation - Limit damage to budget caps - Log every attempt in decision log for review
Tooling: trails sim (author-built), Anthropic SDK with cached prompts, golden-shape assertion library.
Cross-cutting test concerns¶
Coverage targets (summary)¶
| Layer | Unit | Integration | Notes |
|---|---|---|---|
| Rust kernel | ≥ 90% | n/a | cargo tarpaulin |
| Python surface | ≥ 85% | n/a | pytest-cov |
| E2E flows | n/a | 100% of documented capabilities | required before release |
| Policy decisions | 100% branch coverage | via agent-sim | Cedar supports branch tracing |
Performance regression tests¶
Run in CI on every PR:
- Capability dispatch overhead ≤ 10 ms p95 (NFR-Perf1)
- 100-triple write ≤ 20 ms p95 (NFR-Perf2)
tools/list≤ 50 ms (NFR-Perf4)- Regression threshold: 20% slower fails the build
Tooling: criterion (Rust), pytest-benchmark (Python), CI workflow publishes flamegraphs on regression.
Security test scenarios¶
Run in CI weekly:
- SPARQL injection: malicious input strings attempting to escape parameterization; framework must sanitize.
- Biscuit forgery: tampered tokens; kernel must reject.
- DID spoofing: synthetic DID documents; framework must verify against pinned trust roots.
- Policy bypass: attempt to invoke capability without PEP check (shouldn't be possible by construction; verified via fuzzer).
- Validator bypass: malformed RDF attempting to reach graph unvalidated.
Chaos / fault injection (v1+)¶
- Kill Oxigraph mid-transaction — verify no partial writes
- Network partition between Python surface and remote Qlever — verify graceful degradation
- OOM simulation — verify cost envelopes limit damage
CI matrix¶
Required cells (must be green for release; matches NFR-Port1 P0):
| Axis | Values |
|---|---|
| OS / arch | ubuntu-22.04 (x86_64), ubuntu-22.04-arm64 (aarch64), macos-13 (x86_64), macos-14 (aarch64); windows-2022 (x86_64) advisory-only |
| Python | 3.11, 3.12, 3.13 |
| Rust | stable, beta, MSRV (fixed) |
| Graph backend | Oxigraph-embedded, Qlever (testcontainer), Fuseki (testcontainer) |
All four Linux + macOS arch cells are P0 — NFR-Port1 release gate fails if any of the four is red. windows-2022 is P1 (NFR-Port2). GitHub Actions; per-PR: unit + integration + conformance on the P0 set. Nightly: agent-sim, security, performance regression, plus the full matrix including advisory cells.
Test data management¶
- Fixtures: ontology bundles under
tests/fixtures/ontologies/, versioned. - Golden files: PROV subgraphs under
tests/golden/, reviewed on change. - Secrets: never in fixtures; test DIDs/biscuits generated per-run.
- Determinism: any LLM-touching test is either agent-sim (nondeterministic, shape-pinned) or replays cached responses.
Test-related tooling to build¶
| Tool | Band | Purpose |
|---|---|---|
trails-conformance-graph |
4 | Backend conformance suite |
trails-conformance-caps |
4 | Capability projection conformance |
trails sim |
5 | Agent-based testing |
trails.testing.assert_shape |
5 | Shape-pinned assertions |
trails.testing.assert_provenance_chain |
5 | Provenance assertions |
| Cedar policy test harness | 4 | .cedar-test file format + runner |
| Golden-shape fixture generator | 5 | Captures example responses per capability |
Release gates¶
Before any tagged release:
- Band 1–3 must pass on all CI matrix cells.
- Band 4 conformance must pass for all shipped backends.
- Band 5 agent-sim must run clean on all example apps for ≥ 100 iterations.
- Performance regression budget not exceeded.
- Security scenarios re-run manually; findings triaged.
- At least one example app redeployed with the new version.
Release gates by milestone¶
Tight CI min-sets per milestone.
M0 min-set (5 items)¶
| # | Gate | Source |
|---|---|---|
| M0-1 | examples/hello.py boots + one successful MCP tools/call |
Band 3.2 subset |
| M0-2 | trails-graph::test_named_graph_isolation + test_snapshot_rollback pass |
Band 1.1 |
| M0-3 | PyO3 FFI round-trip bench recorded to bench/m0-baseline.json |
Band 1.3 |
| M0-4 | cargo test + pytest examples/ green on ubuntu-22.04-x86_64 and macos-14-aarch64 |
CI matrix |
| M0-5 | License-header gate green (cargo-deny, pip-licenses) |
Band 4.7 |
M1 min-set (9 items)¶
| # | Gate | Source |
|---|---|---|
| M1-1 | All M0 gates still green | — |
| M1-2 | Band 1.1 + 1.2 coverage targets for trails-shapes, trails-prov, trails-caps, trails-graph |
Band 1 |
| M1-3 | Band 2 §2.1, §2.2, §2.4, §2.5 pass | Band 2 |
| M1-4 | Band 3.1 + 3.2 pass; Band 3.3 covers the minimal HTTP adapter subset | Band 3 |
| M1-5 | Band 4.1 Oxigraph conformance passing | Band 4 |
| M1-6 | Band 4.3 PROV conformance passing | Band 4 |
| M1-7 | NFR-Erg1 LLOC gate on examples/hello.py (radon raw ≤ 10 + MCP probe) |
Band 3.7-adjacent |
| M1-8 | NFR-Perf1 + NFR-Perf2 thresholds enforced on ubuntu-22.04-x86_64 |
Performance regression |
| M1-9 | Dogfood service smoke suite reports green Trails invocation | cross-repo CI |