Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
Yuxin Jiang, Bo Huang, Yufei Wang, Xingshan Zeng, Liangyou Li, Yasheng, Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang

TL;DR
This paper introduces BMC, a framework that enhances pairwise preference data modeling for better alignment of large language models by synthesizing responses and learning token-level correlations, outperforming existing methods.
Contribution
The paper proposes a novel BMC framework that improves preference signal quality and models token-level correlations, leading to superior alignment performance over standard DPO.
Findings
BMC significantly outperforms DPO on QA, math, and instruction-following tasks.
Synthesizing pseudo-winning responses enhances preference signal consistency.
Modeling token-level correlations improves nuanced preference understanding.
Abstract
Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
MethodsDirect Preference Optimization · ALIGN
