Toward Optimal LLM Alignments Using Two-Player Games
Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao,, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang, Li, Yang Liu

TL;DR
This paper proposes a two-player game framework for aligning large language models, where adversarial and defensive agents iteratively improve model robustness and generalization beyond traditional prompt-based RLHF methods.
Contribution
It introduces a novel two-agent game approach for LLM alignment, demonstrating convergence to Nash Equilibrium and improved robustness and generalization in safety scenarios.
Findings
Converges to Nash Equilibrium in the proposed game.
Enhances model robustness against adversarial prompts.
Improves generalization capabilities of LLMs.
Abstract
The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security
