ADR-0075: MCP Security Hardening (OWASP MCP Top 10)¶

Status: Accepted
Date: 2026-05-25
Extends: ADR-0008 (MCP primary transport), ADR-0006 (Cedar policy engine), ADR-0052 (Memory security)
Tracks: trails.mcp_security

Context¶

Trails' MCP relay (ADR-0008), SPARQL endpoint (ADR-0023), and Cedar trust stack (ADR-0006) form a distributed tool-execution surface exposed to both human users and autonomous LLM agents. This surface has grown significantly across M3–M29 and now constitutes a production-grade attack vector.

Empirical threat evidence:

MCPTox (Hou et al., AAAI 2026) benchmarked 10 real-world MCP servers against 9 categories of tool-poisoning attacks. o1-mini achieved a 72.8% attack success rate; Claude-3.7-Sonnet refused fewer than 3% of poisoned prompts. Safety alignment alone — model-level instruction-following — is insufficient. Defense must be structural.
PoisonedRAG (Zou et al., USENIX Security 2025) demonstrated that injecting as few as 5 adversarial documents into a retrieval corpus yields a 90% attack success rate on downstream LLM outputs. Trails' trails.vector retrieval path is directly in scope.
The OWASP Top 10 for LLM Applications (2025) codified 10 threat categories for MCP-enabled systems. An audit of Trails' current implementation finds exposure across at least 7 of the 10 categories.

Current gaps:

OWASP MCP threat	Trails exposure
MCP-TOP10-01: Prompt Injection	Tool descriptions not sanitised; MCP relay accepts arbitrary tool descriptions from registered servers
MCP-TOP10-02: Insecure Data Handling	SPARQL query results passed unsanitised to LLM context
MCP-TOP10-03: Tool Definition Tampering (rug pull)	No snapshot of tool definitions at registration; no drift detection
MCP-TOP10-04: Malicious Tool Execution	No per-tool scope allowlist beyond Cedar policies
MCP-TOP10-05: Information Leakage	Tool error messages may leak internal graph structure
MCP-TOP10-06: Excessive Scopes	Cedar policies applied at capability level, not individual tool level
MCP-TOP10-07: Tool Shadowing	No rejection of registrations that shadow existing tool names
MCP-TOP10-08: Indirect Prompt Injection via Resource URIs	Retrieved content from `trails://` resources not sanitised
MCP-TOP10-09: Insufficient Logging	Tool call audit log incomplete; no tamper-evident log
MCP-TOP10-10: Sampling Abuse	No rate limiting per principal on MCP sampling requests

Decision¶

Introduce trails/mcp_security.py implementing a McpSecurityGuard class and MCPThreatModel dataclass. This module is instantiated by the MCP relay at startup and sits in the request/response path for all tool calls.

`MCPThreatModel`¶

Documents all 10 OWASP MCP threats with Trails-specific severity assessments:

@dataclass
class MCPThreat:
    id: str              # e.g. "MCP-TOP10-01"
    name: str
    description: str
    trails_severity: str # "critical" | "high" | "medium" | "low"
    trails_exposure: str # how Trails is specifically exposed
    mitigation_ref: str  # name of the McpSecurityGuard method that addresses it

@dataclass
class MCPThreatModel:
    threats: list[MCPThreat]
    assessed_date: str
    assessor: str

    def to_report(self) -> str:
        """Render a human-readable threat model report."""

    def threats_by_severity(self, severity: str) -> list[MCPThreat]:
        """Filter threats by severity level."""

The MCPThreatModel is exportable as JSON and surfaced via trails security threat-model CLI.

`McpSecurityGuard`¶

class McpSecurityGuard:
    """
    Structural security layer for the MCP relay.

    Instantiated once at MCP server startup. Inspects all tool registrations
    and all inbound tool call requests before dispatch.

    Parameters
    ----------
    cedar_policies : CedarPolicySet
        Loaded Cedar policies for scope-check decisions.
    rate_limit_config : RateLimitConfig
        Per-principal call budgets (calls/minute, calls/hour).
    audit_log : AuditLog
        Append-only audit log sink (file, structured JSON, or OTLP).
    snapshot_dir : str | None
        Directory for tool-definition snapshots (rug-pull detection).
        None disables rug-pull detection.
    """

    # --- Tool Registration ---

    def validate_tool_registration(self, tool_def: ToolDefinition) -> ValidationResult:
        """
        Called when an MCP server registers a tool.

        Checks:
        1. Tool-poisoning scan of tool description (MCP-TOP10-01)
        2. Shadow detection — reject if name conflicts with existing tool (MCP-TOP10-07)
        3. Snapshot the tool definition for rug-pull baseline (MCP-TOP10-03)
        4. Validate scope claim against Cedar tool-level allowlist (MCP-TOP10-06)
        """

    def check_rug_pull(self, tool_name: str, current_def: ToolDefinition) -> RugPullResult:
        """
        Compare current tool definition against snapshotted baseline.

        Returns RugPullResult.CLEAN if unchanged, RugPullResult.DRIFT if
        name/description/parameters differ, RugPullResult.MISSING if no
        baseline exists (first registration).
        (MCP-TOP10-03)
        """

    # --- Request Handling ---

    def check_request(
        self,
        tool_name: str,
        arguments: dict,
        principal: str,
        *,
        context: str | None = None,
    ) -> CheckResult:
        """
        Called before every tool invocation.

        Checks (in order):
        1. Rate limit for principal (MCP-TOP10-10)
        2. Cedar scope check — principal allowed to call this tool (MCP-TOP10-06)
        3. StruQ-style input sanitisation of argument values (MCP-TOP10-01, -02)
        4. Rug-pull check — tool definition unchanged since registration (MCP-TOP10-03)
        5. Audit log entry written (MCP-TOP10-09)
        """

    def sanitise_input(self, value: str) -> str:
        """
        StruQ-style structured input sanitisation.

        Wraps data content in [DATA]...[/DATA] envelope to structurally
        separate it from instruction context. Strips invisible-Unicode
        characters (zero-width spaces, direction overrides, homoglyphs
        in ASCII range). Flags — but does not strip — potential injection
        keywords for caller review.
        (MCP-TOP10-01, USENIX Security 2025)
        """

    def sanitise_retrieval_output(self, passages: list[str]) -> list[str]:
        """
        Sanitise LLM-context passages before injection into agent prompts.

        Detects PoisonedRAG-style adversarial patterns: passages that contain
        instruction-like directives, role-override attempts, or system-prompt
        markers. Flagged passages are wrapped in [UNTRUSTED_CONTENT]...[/UNTRUSTED_CONTENT]
        envelope rather than stripped, so the LLM receives both content and
        trust signal.
        (MCP-TOP10-08, PoisonedRAG USENIX Security 2025)
        """

    # --- Tool-poisoning detection ---

    def scan_tool_description(self, description: str) -> PoisonScanResult:
        """
        Regex + heuristic scan of a tool description for injection patterns.

        Detects:
        - Invisible Unicode: zero-width joiners, directional overrides, homoglyphs
        - System-prompt override attempts: "ignore previous instructions", "you are now"
        - TOOL_DESC injection: embedded instructions mimicking system context
        - Credential/key extraction probes: "print your API key", "send headers to"

        Returns PoisonScanResult with severity, matched patterns, and sanitised description.
        (MCPTox AAAI 2026, MCP-TOP10-01)
        """

    # --- Rate limiting ---

    def check_rate_limit(self, principal: str, tool_name: str) -> RateLimitResult:
        """
        Sliding-window rate limit per principal.

        Configurable: calls per minute, calls per hour, burst allowance.
        Counts are in-memory by default; Redis backend available for
        multi-process deployments.
        (MCP-TOP10-10)
        """

    # --- Audit logging ---

    def log_tool_call(
        self,
        tool_name: str,
        principal: str,
        arguments_hash: str,  # SHA-256 of serialised args — never logs raw args
        result_status: str,
        security_events: list[str],
    ) -> None:
        """
        Write structured audit log entry.

        Format: JSON-L with timestamp, tool_name, principal, args_hash,
        result_status, security_events list, guard_version.

        Never logs raw argument values — only SHA-256 hash. Sensitive
        field filtering is caller-configurable.
        (MCP-TOP10-09, IsolateGPT NDSS 2025)
        """

    # --- Scope checking ---

    def check_tool_scope(self, tool_name: str, principal: str) -> ScopeCheckResult:
        """
        Cedar policy-gated tool-level capability allowlist.

        Each registered tool declares a minimum scope claim. Cedar evaluates:
        PERMIT if principal.scope contains tool.required_scope.
        The Cedar policy is evaluated against the tool-level scope,
        independent of the higher-level capability policy — per-tool
        granularity.
        (MCP-TOP10-06, Cedar ADR-0006)
        """

Tool-shadow detection¶

When a new tool registration arrives, validate_tool_registration() checks the registry for any existing tool with the same name (case-insensitive and Unicode-normalised to NFKC). If a match is found and the registering server differs from the original registrant, the registration is rejected with a ToolShadowingError. Legitimate tool version updates from the same server are allowed but require an explicit force_update=True flag.

Rug-pull detection flow¶

At registration, McpSecurityGuard serialises the full ToolDefinition (name, description, parameters schema) to JSON and writes a SHA-256 snapshot to snapshot_dir/<tool_name>.snapshot.json. On each subsequent call, check_rug_pull() recomputes the hash and compares it to the snapshot. Any drift triggers a RugPullWarning in the audit log and — depending on rug_pull_policy — either blocks the call (policy="strict") or logs and allows it (policy="warn").

StruQ-style input sanitisation¶

The CaMeL (Debenedetti et al., 2025) and StruQ (Chen et al., USENIX Security 2025) approaches both demonstrate that structural separation of instruction from data is more robust than semantic filtering. The sanitise_input() method applies this:

[INST] {tool_description} [/INST]
[DATA] {user_argument_value} [/DATA]

This envelope is added before the argument value is injected into any LLM prompt. The LLM sees the structural boundary; injection attempts embedded in [DATA] are treated as data, not instructions.

Rate limiting configuration¶

# trails.toml
[security.rate_limits]
default_calls_per_minute = 60
default_calls_per_hour = 1000
burst_allowance = 10

[[security.rate_limits.overrides]]
principal_pattern = "did:key:*"        # glob-style
calls_per_minute = 120
calls_per_hour = 5000

[[security.rate_limits.overrides]]
tool_name = "trails://kg/write"       # write tools get stricter limits
calls_per_minute = 10

Integration with existing infrastructure¶

McpSecurityGuard is instantiated inside the MCP server startup in trails.mcp_server, threaded through the existing DispatchCoordinator request path.
Cedar scope checks call the existing PolicyEngine (ADR-0006) — no duplicate policy engine.
Audit logs integrate with OTel trace propagation (ADR-0071) — each audit entry carries the W3C traceparent from the active span.
ARGUS-style context monitoring (arXiv:2605.03378) is supported by passing context= (the accumulated agent conversation context) to check_request(), enabling detection of multi-turn injection attacks that span multiple tool calls.

CLI surface¶

trails security audit-log [--tail N] [--since ISO8601]
trails security threat-model [--format json|text]
trails security snapshot list
trails security snapshot verify <tool_name>

Non-goals¶

No reimplementation of Cedar. Scope checking calls the existing PolicyEngine.
No replacement of existing SSRF protection (already shipped). This module targets MCP-specific threats.
No client-side sandboxing. IsolateGPT-style subprocess/container isolation is deferred to M36 (IsolateGPT MCP Sandbox milestone).
No ML-based classifier for injection detection — regex + heuristics only in this milestone. ML classifier is a future enhancement.

Consequences¶

Positive¶

Structural defense. Tool-poisoning detection, rug-pull detection, and StruQ-style sanitisation are structural — they do not rely on LLM safety alignment, which MCPTox demonstrates is insufficient.
Auditability. Every tool call generates a tamper-evident audit log entry. Cedar policy scope is checked per-tool, not just per-capability.
OWASP coverage. All 10 OWASP MCP Top 10 threats are modelled in MCPThreatModel with Trails-specific severity and mitigation references.
Composable. McpSecurityGuard is a standalone class — testable in isolation, injectable into any MCP relay implementation, not coupled to specific LLM providers.

Negative¶

False positives in tool-poisoning scan. Regex-based heuristics will occasionally flag legitimate tool descriptions that contain instruction-like language. Configurable allowlist mitigates this.
Rate-limit coordination. Default in-memory rate limiting does not coordinate across multiple Trails processes. Multi-process deployments require the Redis backend configuration.
Snapshot storage. Rug-pull snapshots add a small file per registered tool. Snapshot pruning is manual (no auto-expiry in this milestone).

Non-consequences¶

ADR-0008 (MCP primary transport) unchanged. The relay protocol is unchanged; McpSecurityGuard sits in the existing request path.
ADR-0006 (Cedar) unchanged. Scope checking calls the existing PolicyEngine; the Cedar policy files gain tool-scope claims but the engine is unmodified.
ADR-0052 (Memory security) unchanged. Memory security gateway and MCP security guard are complementary layers.

Revisit conditions¶

If OWASP publishes an updated MCP Top 10, re-assess coverage and update MCPThreatModel.
If MCPTox or follow-on papers publish updated attack benchmarks, re-run heuristic evaluation against new attack patterns.
If Trails adopts a formal verification framework (e.g., Cedar analysis), integrate tool-scope verification into CI.
If IsolateGPT (M36) ships subprocess isolation, move the scope-enforcement outer ring to the container boundary.

References¶

Hou, Y., Zhao, W., Feng, X., Li, J., & Zhao, Y. (2026). MCPTox: Empirically Benchmarking LLM Agents Against Tool Poisoning Attacks. AAAI Conference on Artificial Intelligence 2026. arXiv:2508.14925.
OWASP Foundation. (2025). OWASP Top 10 for Large Language Model Applications: MCP Security Edition. https://owasp.org/www-project-top-10-for-large-language-model-applications/
Zou, W., Geng, R., Wang, B., & Jia, J. (2025). PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. USENIX Security Symposium 2025. arXiv:2402.07867.
Chen, S., Piet, J., Sitawarin, C., & Wagner, D. (2025). StruQ: Defending Against Prompt Injection with Structured Queries. USENIX Security Symposium 2025.
Debenedetti, E., Khomenko, J., Küchler, M., Schönherr, L., & Tramèr, F. (2025). CaMeL: Defeating Prompt Injections by Design. arXiv:2503.18813.
Wu, J., Roesner, F., Kohno, T., & Raj, A. (2025). IsolateGPT: Execution Isolation for LLM-Based Agentic Systems. ISOC Network and Distributed System Security Symposium (NDSS 2025).
Radosevich, B., & Halloran, J. D. (2025). MCP Safety Audit: Assessing Security Risks in the Model Context Protocol Ecosystem. arXiv:2504.03767.
Hou, Y., Zhao, W., & Li, J. (2025). Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. Preprint.
Anonymous. (2025). ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection. arXiv:2605.03378.