DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion
Ruofan Liu, Yun Lin, Zhiyong Huang, Jin Song Dong

TL;DR
DRIP introduces a novel method for defending large language models against prompt injection attacks by editing token representations and fusing residual instructions, significantly improving security while maintaining utility.
Contribution
It presents a new token-wise representation editing and residual instruction fusion approach that effectively prevents prompt injection without sacrificing model performance.
Findings
Reduces attack success rate by over 66% under adaptive attacks.
Improves role-separation score by 12-49%.
Maintains utility comparable to undefended models.
Abstract
Large language models (LLMs) are increasingly integrated into IT infrastructures, where they process user data according to predefined instructions. However, conventional LLMs remain vulnerable to prompt injection, where malicious users inject directive tokens into the data to subvert model behavior. Existing defenses train LLMs to semantically separate data and instruction tokens, but still struggle to (1) balance utility and security and (2) prevent instruction-like semantics in the data from overriding the intended instructions. We propose DRIP, which (1) precisely removes instruction semantics from tokens in the data section while preserving their data semantics, and (2) robustly preserves the effect of the intended instruction even under strong adversarial content. To "de-instructionalize" data tokens, DRIP introduces a data curation and training paradigm with a lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Web Application Security Vulnerabilities
