RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Jijun Zhang,, Ruifang He, Yuexian Hou

TL;DR
This paper systematically analyzes character hallucination in role-playing LLM systems, identifying core mechanisms, evaluating mitigation techniques, and proposing a Narrator Mode to reduce hallucinations and improve role fidelity.
Contribution
It introduces the RoleBreak framework for analyzing hallucinations, creates the RoleBreakEval dataset, and proposes the Narrator Mode defense strategy to enhance role adherence.
Findings
Enhanced models still vulnerable to hallucination attacks.
Narrator Mode significantly reduces hallucinations.
Narrator Mode improves role fidelity and coherence.
Abstract
Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms-query sparsity and role-query conflict-as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychopathy, Forensic Psychiatry, Sexual Offending · Stalking, Cyberstalking, and Harassment · Personality Disorders and Psychopathology
