Iterative Nash Policy Optimization: Aligning LLMs with General   Preferences via No-Regret Learning

Yuheng Zhang; Dian Yu; Baolin Peng; Linfeng Song; Ye Tian; Mingyue; Huo; Nan Jiang; Haitao Mi; Dong Yu

arXiv:2407.00617·cs.LG·March 4, 2025·1 cites

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue, Huo, Nan Jiang, Haitao Mi, Dong Yu

PDF

Open Access 1 Video

TL;DR

This paper introduces INPO, a novel no-regret learning algorithm for aligning large language models with human preferences by framing RLHF as a game and avoiding costly response evaluations.

Contribution

It proposes a game-theoretic approach with an online algorithm that bypasses traditional reward estimation, improving efficiency and effectiveness in LLM alignment.

Findings

01

Achieves 42.6% win rate on AlpacaEval 2.0

02

Attains 37.8% win rate on Arena-Hard

03

Outperforms existing online RLHF methods

Abstract

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning· slideslive

Taxonomy

TopicsAuction Theory and Applications

MethodsShrink and Fine-Tune