Prompt Injection as Role Confusion

Charles Ye; Jasmine Cui; Dylan Hadfield-Menell

arXiv:2603.12277·cs.CL·April 17, 2026

Prompt Injection as Role Confusion

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

PDF

TL;DR

This paper investigates prompt injection vulnerabilities in language models, attributing them to role confusion where models misinterpret the source of text, leading to successful attacks and internal role perception issues.

Contribution

It introduces role probes to measure internal role perception and presents a framework framing prompt injection as a consequence of role representation in models.

Findings

01

60% attack success rate on StrongREJECT models with role confusion

02

Role confusion correlates strongly with attack success

03

Generalizes to standard agent prompt injections with a unifying framework

Abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer the source of text based on how it sounds, not where it actually comes from. A command hidden in a webpage hijacks an agent simply because it sounds like a user instruction. This is not just behavioral: in the model's internal representations, text that sounds like a trusted source occupies the same space as text that actually is one. We design role probes which measure how models internally perceive "who is speaking", showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception. We first test this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts or ingested webpages. Models mistake the text for their own thoughts, yielding 60% attack success on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.