Safety Alignment of LMs via Non-cooperative Games
Anselm Paulus, Ilia Kulikov, Brandon Amos, R\'emi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov

TL;DR
This paper proposes a novel non-cooperative game framework for safety alignment of language models, using reinforcement learning with preference-based rewards to improve robustness and utility.
Contribution
It introduces a joint training paradigm with adversarial and defensive LMs via online RL, enhancing safety and usefulness over traditional methods.
Findings
Defender LM becomes more helpful and resilient to attacks.
Attacker LM evolves into a strong red-teaming agent.
Method shifts the safety-utility Pareto frontier.
Abstract
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling
