DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Yulan Yuan, Zhijiang Guo, Wei Wang

TL;DR
DGPO introduces a groupwise preference optimization framework that enhances directional consistency and reasoning diversity in LLMs, leading to improved performance across benchmarks.
Contribution
It proposes a novel groupwise optimization method that captures richer relative information and improves alignment in LLM preference modeling.
Findings
Reverse data yields 3.2% average improvement across five benchmarks.
DGPO achieves up to 3.6% accuracy gains across datasets and models.
Groupwise formulation reinforces consistency in reasoning pathways.
Abstract
Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
