Simple Role Assignment is Extraordinarily Effective for Safety Alignment
Zhou Ziheng, Jiakun Ding, Zhaowei Zhang, Ruosen Gao, Yingnian Wu, Demetri Terzopoulos, Yipeng Kang, Fangwei Zhong, Junqi Wang

TL;DR
This paper introduces role conditioning as a simple, effective method for safety alignment in AI, outperforming traditional principle-based approaches and significantly reducing unsafe outputs across multiple benchmarks.
Contribution
It proposes a training-free, role-conditioned generation and critique pipeline grounded in Theory of Mind, demonstrating superior safety performance in large language models.
Findings
Reduces unsafe outputs from 81.4% to 3.6% on WildJailbreak
Outperforms principle-based and Chain-of-Thought baselines
Consistently effective across five model families
Abstract
Principle-based alignment often lacks context sensitivity and completeness. Grounded in Theory of Mind, we propose role conditioning as a compact alternative: social roles (e.g., mother, judge) implicitly encode both values and the cognitive schemas required to apply them. We introduce a training-free pipeline featuring a role-conditioned generator and iterative role-based critics for refinement. Across five model families, our approach consistently outperforms principle-based, Chain-of-Thought (CoT) and other baselines across benchmarks. Notably, it reduces unsafe outputs on the WildJailbreak benchmark from 81.4\% to 3.6\% with DeepSeek-V3. Not only for common safety benchmarks, it consistently applies for agentic safety tasks. These results establish role assignment as a powerful, interpretable paradigm for AI alignment and LLM-as-a-Judge construction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Topic Modeling
