D2PO: Discriminator-Guided DPO with Response Evaluation Models

Prasann Singhal; Nathan Lambert; Scott Niekum; Tanya Goyal; Greg; Durrett

arXiv:2405.01511·cs.CL·August 8, 2024

D2PO: Discriminator-Guided DPO with Response Evaluation Models

Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg, Durrett

PDF

Open Access 1 Repo

TL;DR

D2PO introduces a discriminator-guided approach to improve language model alignment by using response evaluation models for better response quality and data efficiency during training.

Contribution

The paper proposes D2PO, a novel method that integrates a discriminator for response evaluation into DPO, enhancing response quality and training efficiency.

Findings

01

D2PO outperforms DPO with the same data budget.

02

D2PO achieves higher response quality in diverse tasks.

03

Silver labeling is most effective when training with DPO and using a separate discriminator.

Abstract

Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PrasannS/d2po
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsDirect Preference Optimization · Sparse Evolutionary Training · Entropy Regularization · Proximal Policy Optimization