Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed, Awadallah, Tengyang Xie

TL;DR
This paper introduces Direct Nash Optimization (DNO), a scalable algorithm for improving large language models by directly optimizing general preferences, leading to state-of-the-art performance against GPT-4-Turbo.
Contribution
DNO provides a provable, efficient, and stable method for directly optimizing general preferences in LLMs, surpassing traditional reward-based approaches.
Findings
DNO achieves a 33% win-rate against GPT-4-Turbo on AlpacaEval 2.0.
The 7B Orca-2.5 model with DNO outperforms larger models like Mistral 70B.
DNO demonstrates monotonic improvement over iterations.
Abstract
This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Intelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques
MethodsSoftmax · Linear Layer · Dense Connections · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam
