Direct Nash Optimization: Teaching Language Models to Self-Improve with   General Preferences

Corby Rosset; Ching-An Cheng; Arindam Mitra; Michael Santacroce; Ahmed; Awadallah; Tengyang Xie

arXiv:2404.03715·cs.LG·April 8, 2024·2 cites

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed, Awadallah, Tengyang Xie

PDF

Open Access

TL;DR

This paper introduces Direct Nash Optimization (DNO), a scalable algorithm for improving large language models by directly optimizing general preferences, leading to state-of-the-art performance against GPT-4-Turbo.

Contribution

DNO provides a provable, efficient, and stable method for directly optimizing general preferences in LLMs, surpassing traditional reward-based approaches.

Findings

01

DNO achieves a 33% win-rate against GPT-4-Turbo on AlpacaEval 2.0.

02

The 7B Orca-2.5 model with DNO outperforms larger models like Mistral 70B.

03

DNO demonstrates monotonic improvement over iterations.

Abstract

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Intelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques

MethodsSoftmax · Linear Layer · Dense Connections · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam