2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional   Supervision

Shilong Li; Yancheng He; Hui Huang; Xingyuan Bu; Jiaheng Liu; Hangyu; Guo; Weixun Wang; Jihao Gu; Wenbo Su; Bo Zheng

arXiv:2410.19720·cs.CL·October 28, 2024

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu, Guo, Weixun Wang, Jihao Gu, Wenbo Su, Bo Zheng

PDF

Open Access 1 Video

TL;DR

This paper introduces 2D-DPO, a novel framework that extends preference optimization for LLMs to two dimensions—segments and aspects—using a new dataset, leading to improved alignment with human preferences.

Contribution

The work proposes a 2D preference optimization framework and a new dataset, enabling multi-dimensional feedback to better align LLMs with human preferences.

Findings

01

2D-DPO outperforms scalar and 1D preference methods on benchmarks.

02

The dataset HelpSteer-2D captures multi-dimensional human preferences.

03

Multi-dimensional feedback improves LLM alignment quality.

Abstract

Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision· underline

Taxonomy

TopicsData Management and Algorithms

MethodsDirect Preference Optimization