Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu; Liwei Jiang; Yancheng Liang; Simon Shaolei Du; Yejin Choi; Tim Althoff; Natasha Jaques

arXiv:2506.07468·cs.LG·October 7, 2025

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

PDF

1 Repo 3 Models 3 Reviews

TL;DR

This paper introduces Self-RedTeam, an online self-play reinforcement learning method where attacker and defender agents co-evolve to improve language model safety, achieving more diverse attacks and higher robustness through a game-theoretic framework.

Contribution

It presents a novel online self-play reinforcement learning approach for LM safety, with theoretical safety guarantees and empirical improvements over static defenses.

Findings

01

Uncovers 21.8% more diverse attacks compared to static defenders

02

Achieves 65.5% higher robustness on safety benchmarks

03

Introduces hidden Chain-of-Thought to enhance adversarial diversity

Abstract

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The method is practical and efficient with a thorough analysis on the overhead of their approach (~45% longer than baseline with online generation). The framework is general and can be applied to any safety training pipeline, with a reasonable improvement to the refusal rate on the models tested. 2. The paper is very well written and polished; the authors conduct many experimental results, including comparisons to other safeguarding baselines (LAT, CircuitBreakers), as well as ablations on th

Weaknesses

1. If I understood correctly, the evaluations were done against *static* adversarial prompts (with the exception of X-teaming); stronger non-static attacks should be considered for the evaluations (i.e. applying some of the algorithmic methods to the final trained model itself, rather than using the preexisting attacks on other models). If the paper is indeed missing these evals, I would strongly recommend them for the discussion period. 2. Results indicate that the improvements are decent but n

Reviewer 02Rating 6Confidence 2

Strengths

Strengths: - Conceptually appealing formulation of LLM safety as a self-play MARL problem with a Nash equilibrium–based safety guarantee. - Solid empirical results across multiple model families (Llama, Qwen), demonstrating significant robustness gains (up to 95% ASR reduction) with minimal performance degradation. - The Hidden CoT mechanism is an elegant addition, improving attack diversity and mitigating over-refusal.

Weaknesses

Weaknesses: - The theoretical guarantee relies heavily on the quality of the reward model; practical convergence to Nash equilibrium is not verified. - Some evaluation benchmarks (e.g., WildGuard/WildJailBreak) overlap with training data, potentially inflating results. - Experimental section could be more transparent about compute cost and stability during training.

Reviewer 03Rating 4Confidence 3

Strengths

- It formulates red-teaming as a two-player zero-sum game with a formal safety guarantee at Nash Equilibrium. - It shows strong empirical results, showing consistent gains across 12 benchmark and multiple model families and sizes. - Extensive ablations show the effectiveness of each proposed components.

Weaknesses

- Reward model and policy (defender and attacker) share the same parameter $\theta$, which looks confusing. Given that the WildGuard is used for reward model, it must be a notational mistake. I think it would be better to use different parameters and explicitly state the reward model is frozen during entire training. - The KL term in Eq. is undefined. I guess the authors might use token-wise reverse KL, but it would better to explicitly define the term for clarity. - There is no direct head

Code & Models

Repositories

mickelliu/selfplay-redteaming
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsActivation Patching