Prompt Injection as Role Confusion
Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

TL;DR
This paper investigates prompt injection vulnerabilities in language models, attributing them to role confusion where models misinterpret the source of text, leading to successful attacks and internal role perception issues.
Contribution
It introduces role probes to measure internal role perception and presents a framework framing prompt injection as a consequence of role representation in models.
Findings
60% attack success rate on StrongREJECT models with role confusion
Role confusion correlates strongly with attack success
Generalizes to standard agent prompt injections with a unifying framework
Abstract
Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer the source of text based on how it sounds, not where it actually comes from. A command hidden in a webpage hijacks an agent simply because it sounds like a user instruction. This is not just behavioral: in the model's internal representations, text that sounds like a trusted source occupies the same space as text that actually is one. We design role probes which measure how models internally perceive "who is speaking", showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception. We first test this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts or ingested webpages. Models mistake the text for their own thoughts, yielding 60% attack success on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
