TL;DR
This paper investigates how large language models learn to distinguish between different input roles, revealing reliance on superficial cues, and proposes a method to reinforce invariant signals for more robust role separation.
Contribution
It identifies the shortcuts LLMs use for role distinction and introduces a position ID manipulation technique to improve role boundary recognition.
Findings
Models exploit task type and position cues for role identification.
Data augmentation offers limited improvements, leading to patching.
Position ID adjustments enhance role boundary learning.
Abstract
Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsActivation Patching
