TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan; Wenhan Yu; Jianfeng Si; Tongxin Liu; Kaiqi Guan; Huiyan Jin; Jiawen Tao; Xiaokun Yuan; Duohe Ma; Xiangzheng Zhang; Tong Yang; Lin Sun

arXiv:2601.18292·cs.LG·February 2, 2026

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun

PDF

Open Access

TL;DR

TriPlay-RL introduces a closed-loop reinforcement learning framework that enhances safety alignment in large language models by enabling iterative collaboration among attacker, defender, and evaluator roles with minimal manual annotation.

Contribution

It presents a novel co-improving, multi-role RL framework for LLM safety alignment that reduces manual effort and improves safety, robustness, and evaluation accuracy.

Findings

01

Attacker maintains high output diversity and improves adversarial effectiveness by 20-50%.

02

Defender improves safety performance by 10-30% without harming reasoning.

03

Evaluator enhances judgment accuracy, distinguishing unsafe, refusal, and useful responses.

Abstract

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling