Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
Haochuan Kevin Wang, Zechen Zhang

TL;DR
This paper introduces a stage-level tracking methodology for prompt injection in multi-agent LLM systems, revealing how different models and defenses perform across attack surfaces and informing deployment safety decisions.
Contribution
It presents a novel cryptographic token-based kill-chain canary approach to diagnose prompt injection vulnerabilities across multiple stages and defenses in production-like settings.
Findings
Write-node placement is crucial for safety; verified routing prevents propagation.
All defenses fail on at least one surface without adversarial adaptation.
Invisible payloads can match or surpass visible-text attack success rates.
Abstract
Multi-agent LLM systems are entering production -- processing documents, managing workflows, acting on behalf of users -- yet their resilience to prompt injection is still evaluated with a single binary: did the attack succeed? This leaves architects without the diagnostic information needed to harden real pipelines. We introduce a kill-chain canary methodology that tracks a cryptographic token through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) across 950 runs, five frontier LLMs, six attack surfaces, and five defense conditions. The results reframe prompt injection as a pipeline-architecture problem: every model is fully exposed, yet outcomes diverge downstream -- Claude blocks all injections at memory-write (0/164 ASR), GPT-4o-mini propagates at 53%, and DeepSeek exhibits 0%/100% across surfaces from the same model. Three findings matter for deployment: (1) write-node…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
