GDPO: Learning to Directly Align Language Models with Diversity Using   GFlowNets

Oh Joon Kwon; Daiki E. Matsunaga; Kee-Eung Kim

arXiv:2410.15096·cs.AI·October 22, 2024

GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets

Oh Joon Kwon, Daiki E. Matsunaga, Kee-Eung Kim

PDF

Open Access

TL;DR

This paper introduces GDPO, a diversity-focused method for aligning language models with human preferences, which improves response diversity while maintaining alignment in dialog and summarization tasks.

Contribution

The paper proposes GFlowNet-DPO (GDPO), a novel approach combining GFlowNets with preference optimization to enhance diversity in language model outputs.

Findings

01

GDPO produces significantly more diverse responses.

02

GDPO maintains alignment with human preferences.

03

GDPO outperforms baseline methods in diversity metrics.

Abstract

A critical component of the current generation of language models is preference alignment, which aims to precisely control the model's behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF) and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences. In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO can generate far more diverse responses than the baseline methods that are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsDirect Preference Optimization