Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus; Ilia Kulikov; Brandon Amos; R\'emi Munos; Ivan Evtimov; Kamalika Chaudhuri; Arman Zharmagambetov

arXiv:2512.20806·cs.AI·February 10, 2026

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos, R\'emi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov

PDF

Open Access

TL;DR

This paper proposes a novel non-cooperative game framework for safety alignment of language models, using reinforcement learning with preference-based rewards to improve robustness and utility.

Contribution

It introduces a joint training paradigm with adversarial and defensive LMs via online RL, enhancing safety and usefulness over traditional methods.

Findings

01

Defender LM becomes more helpful and resilient to attacks.

02

Attacker LM evolves into a strong red-teaming agent.

03

Method shifts the safety-utility Pareto frontier.

Abstract

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling