Simple Role Assignment is Extraordinarily Effective for Safety Alignment

Zhou Ziheng; Jiakun Ding; Zhaowei Zhang; Ruosen Gao; Yingnian Wu; Demetri Terzopoulos; Yipeng Kang; Fangwei Zhong; Junqi Wang

arXiv:2602.00061·cs.CY·February 3, 2026

Simple Role Assignment is Extraordinarily Effective for Safety Alignment

Zhou Ziheng, Jiakun Ding, Zhaowei Zhang, Ruosen Gao, Yingnian Wu, Demetri Terzopoulos, Yipeng Kang, Fangwei Zhong, Junqi Wang

PDF

Open Access

TL;DR

This paper introduces role conditioning as a simple, effective method for safety alignment in AI, outperforming traditional principle-based approaches and significantly reducing unsafe outputs across multiple benchmarks.

Contribution

It proposes a training-free, role-conditioned generation and critique pipeline grounded in Theory of Mind, demonstrating superior safety performance in large language models.

Findings

01

Reduces unsafe outputs from 81.4% to 3.6% on WildJailbreak

02

Outperforms principle-based and Chain-of-Thought baselines

03

Consistently effective across five model families

Abstract

Principle-based alignment often lacks context sensitivity and completeness. Grounded in Theory of Mind, we propose role conditioning as a compact alternative: social roles (e.g., mother, judge) implicitly encode both values and the cognitive schemas required to apply them. We introduce a training-free pipeline featuring a role-conditioned generator and iterative role-based critics for refinement. Across five model families, our approach consistently outperforms principle-based, Chain-of-Thought (CoT) and other baselines across benchmarks. Notably, it reduces unsafe outputs on the WildJailbreak benchmark from 81.4\% to 3.6\% with DeepSeek-V3. Not only for common safety benchmarks, it consistently applies for agentic safety tasks. These results establish role assignment as a powerful, interpretable paradigm for AI alignment and LLM-as-a-Judge construction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Topic Modeling