AI Agents May Always Fall for Prompt Injections
Sahar Abdelnabi, Eugene Bagdasarian

TL;DR
Prompt injection poses a critical vulnerability in AI agents, and current defenses are insufficient; a new framework based on Contextual Integrity offers a principled way to evaluate and improve security.
Contribution
Reframes prompt injection attacks using Contextual Integrity theory, revealing fundamental limitations of existing defenses and proposing a new evaluation framework for context-sensitive failures.
Findings
Current defenses fail against contextual manipulation attacks.
An impossibility result shows defenders cannot perfectly block malicious flows.
CI-based framework enables principled evaluation and alignment of AI agents.
Abstract
Prompt injection is the most critical vulnerability in deployed AI agents. Despite recent progress, we show that the prevailing defense paradigm (data-instruction separation) both fails to detect attacks that operate through contextual manipulation and degrades contextually appropriate behavior. We then recast prompt injection via the lens of Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms. This explains types of attacks that current defenses attempt to patch and predict advanced ones future agents will face. We develop unique benign and attack scenarios that force an agent to violate the norms by (1) misrepresenting the flow, (2) manipulating norms, or (3) mixing multiple flows. This reframing suggests an impossibility result: an adversary can always construct a context under which a blocked flow appears legitimate, or a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
