Proximal Point Nash Learning from Human Feedback

Daniil Tiapkin; Daniele Calandriello; Denis Belomestny; Eric Moulines; Alexey Naumov; Kashif Rasul; Michal Valko; Pierre Menard

arXiv:2505.19731·stat.ML·March 24, 2026

Proximal Point Nash Learning from Human Feedback

Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new Nash learning framework from human feedback that directly models preferences without relying on traditional reward models, providing theoretical convergence guarantees and practical applications to language models.

Contribution

It develops a proximal point-based Nash learning algorithm with convergence guarantees and demonstrates its effectiveness in large language model post-training.

Findings

01

Proposed a stabilized Nash learning algorithm with convergence guarantees.

02

Validated the method's empirical performance on large language models.

03

Analyzed stability limitations of existing policy gradient approaches.

Abstract

Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley--Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. While many works study the Nash learning problem directly in the policy space, we instead consider it under a more realistic policy parametrization setting. We first analyze a simple self-play policy gradient method, which is equivalent to Online IPO. We establish high-probability last-iterate convergence guarantees for this method, but our analysis also reveals a possible stability limitation of the underlying dynamics. Motivated by this, we embed the self-play…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Speech and dialogue systems · Topic Modeling