The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Gabriele La Malfa, Emanuele La Malfa, Saar Cohen, Jie M. Zhang, Michael Luck, Michael Wooldridge, Elizabeth Black

TL;DR
This paper introduces Anchored Bipolicy Self-Play, a novel method that trains separate role-specific adapters on a frozen base model to improve AI safety and adversarial robustness, overcoming limitations of traditional self-play.
Contribution
It proposes a new self-play approach with role separation using adapters, achieving greater efficiency and safety improvements over standard self-play methods.
Findings
Up to 100x parameter efficiency compared to finetuning.
Consistent safety improvements over self-play fine-tuned models.
Enhanced robustness in safety benchmarks and adversarial defense.
Abstract
Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
