Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue, Huo, Nan Jiang, Haitao Mi, Dong Yu

TL;DR
This paper introduces INPO, a novel no-regret learning algorithm for aligning large language models with human preferences by framing RLHF as a game and avoiding costly response evaluations.
Contribution
It proposes a game-theoretic approach with an online algorithm that bypasses traditional reward estimation, improving efficiency and effectiveness in LLM alignment.
Findings
Achieves 42.6% win rate on AlpacaEval 2.0
Attains 37.8% win rate on Arena-Hard
Outperforms existing online RLHF methods
Abstract
Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuction Theory and Applications
MethodsShrink and Fine-Tune
