Toward Optimal LLM Alignments Using Two-Player Games

Rui Zheng; Hongyi Guo; Zhihan Liu; Xiaoying Zhang; Yuanshun Yao,; Xiaojun Xu; Zhaoran Wang; Zhiheng Xi; Tao Gui; Qi Zhang; Xuanjing Huang; Hang; Li; Yang Liu

arXiv:2406.10977·cs.CL·June 18, 2024

Toward Optimal LLM Alignments Using Two-Player Games

Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao,, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang, Li, Yang Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a two-player game framework for aligning large language models, where adversarial and defensive agents iteratively improve model robustness and generalization beyond traditional prompt-based RLHF methods.

Contribution

It introduces a novel two-agent game approach for LLM alignment, demonstrating convergence to Nash Equilibrium and improved robustness and generalization in safety scenarios.

Findings

01

Converges to Nash Equilibrium in the proposed game.

02

Enhances model robustness against adversarial prompts.

03

Improves generalization capabilities of LLMs.

Abstract

The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruizheng20/gpo
pytorchOfficial

Videos

Toward Optimal LLM Alignments Using Two-Player Games· underline

Taxonomy

TopicsDigital Rights Management and Security