TL;DR
This paper introduces Self-RedTeam, an online self-play reinforcement learning method where attacker and defender agents co-evolve to improve language model safety, achieving more diverse attacks and higher robustness through a game-theoretic framework.
Contribution
It presents a novel online self-play reinforcement learning approach for LM safety, with theoretical safety guarantees and empirical improvements over static defenses.
Findings
Uncovers 21.8% more diverse attacks compared to static defenders
Achieves 65.5% higher robustness on safety benchmarks
Introduces hidden Chain-of-Thought to enhance adversarial diversity
Abstract
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The method is practical and efficient with a thorough analysis on the overhead of their approach (~45% longer than baseline with online generation). The framework is general and can be applied to any safety training pipeline, with a reasonable improvement to the refusal rate on the models tested. 2. The paper is very well written and polished; the authors conduct many experimental results, including comparisons to other safeguarding baselines (LAT, CircuitBreakers), as well as ablations on th
1. If I understood correctly, the evaluations were done against *static* adversarial prompts (with the exception of X-teaming); stronger non-static attacks should be considered for the evaluations (i.e. applying some of the algorithmic methods to the final trained model itself, rather than using the preexisting attacks on other models). If the paper is indeed missing these evals, I would strongly recommend them for the discussion period. 2. Results indicate that the improvements are decent but n
Strengths: - Conceptually appealing formulation of LLM safety as a self-play MARL problem with a Nash equilibrium–based safety guarantee. - Solid empirical results across multiple model families (Llama, Qwen), demonstrating significant robustness gains (up to 95% ASR reduction) with minimal performance degradation. - The Hidden CoT mechanism is an elegant addition, improving attack diversity and mitigating over-refusal.
Weaknesses: - The theoretical guarantee relies heavily on the quality of the reward model; practical convergence to Nash equilibrium is not verified. - Some evaluation benchmarks (e.g., WildGuard/WildJailBreak) overlap with training data, potentially inflating results. - Experimental section could be more transparent about compute cost and stability during training.
- It formulates red-teaming as a two-player zero-sum game with a formal safety guarantee at Nash Equilibrium. - It shows strong empirical results, showing consistent gains across 12 benchmark and multiple model families and sizes. - Extensive ablations show the effectiveness of each proposed components.
- Reward model and policy (defender and attacker) share the same parameter $\theta$, which looks confusing. Given that the WildGuard is used for reward model, it must be a notational mistake. I think it would be better to use different parameters and explicitly state the reward model is frozen during entire training. - The KL term in Eq. is undefined. I guess the authors might use token-wise reverse KL, but it would better to explicitly define the term for clarity. - There is no direct head
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsActivation Patching
