Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang; Yanting Wang; Hao Li; Rui Li; Lei Sha

arXiv:2601.10589·cs.CR·January 16, 2026

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha

PDF

Open Access

TL;DR

This paper introduces Safety Self-Play (SSP), enabling LLMs to autonomously generate and defend against adversarial attacks through self-play and reflective experience replay, leading to improved safety alignment.

Contribution

The paper presents a novel self-play framework with reflective experience replay that allows LLMs to autonomously evolve adversarial attacks and defenses, surpassing static dataset-based methods.

Findings

01

Outperforms baseline models trained on static adversarial datasets.

02

Evolves robust defense strategies through autonomous self-play.

03

Establishes a new benchmark for proactive safety in LLMs.

Abstract

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak'' attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Explainable Artificial Intelligence (XAI)