Bypassing Prompt Injection Detectors through Evasive Injections
Md Jahedur Rahman, Ihsen Alouani

TL;DR
This paper demonstrates that prompt injection detectors based on activation shifts are vulnerable to adaptive evasion attacks using optimized suffixes, and proposes a robust defense via adversarial suffix augmentation.
Contribution
It introduces a multi-probe evasion attack that effectively bypasses existing prompt injection detectors and proposes a novel defense method using adversarial suffix augmentation.
Findings
Single suffix achieves over 93% success rate in evading detectors.
Detectors based on activation shifts are highly vulnerable to adaptive attacks.
Adversarial suffix augmentation improves detector robustness.
Abstract
Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to prompt injection attacks, where injected secondary prompts force the model to deviate from the user's instructions to execute a potentially malicious task defined by the adversary. Recent work shows that ML models trained on activation shifts from LLMs' hidden layers can detect such drift. In this paper, we demonstrate that these detectors are not robust to adaptive adversaries. We propose a multi-probe evasion attack that appends an adversarially optimised suffix to poisoned inputs, jointly optimising a universal suffix to simultaneously fool all layer-wise drift detectors while preserving the effectiveness of the underlying injection. Using a modified Greedy Coordinate Gradient (GCG) approach, we generate universal suffixes that make prompt injections…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
