The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning   (and How to Fix Them)

Zihao Wang; Yibo Jiang; Jiahao Yu; Heqing Huang

arXiv:2505.00626·cs.CL·May 6, 2025

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

Zihao Wang, Yibo Jiang, Jiahao Yu, Heqing Huang

PDF

1 Video

TL;DR

This paper investigates how large language models learn to distinguish between different input roles, revealing reliance on superficial cues, and proposes a method to reinforce invariant signals for more robust role separation.

Contribution

It identifies the shortcuts LLMs use for role distinction and introduces a position ID manipulation technique to improve role boundary recognition.

Findings

01

Models exploit task type and position cues for role identification.

02

Data augmentation offers limited improvements, leading to patching.

03

Position ID adjustments enhance role boundary learning.

Abstract

Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)· slideslive

Taxonomy

MethodsActivation Patching