DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She; Yu Bao; Yu Lu; Lu Xu; Tao Li; Wenhao Zhu; Shujian Huang; Shanbo Cheng; Lu Lu; Yuxuan Wang

arXiv:2508.14460·cs.LG·August 21, 2025

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang

PDF

Open Access

TL;DR

DuPO introduces a dual learning framework that enables reliable self-verification of large language models without requiring costly annotations, improving performance across translation, reasoning, and reranking tasks.

Contribution

It proposes a novel dual preference optimization method that broadens dual learning applicability to non-invertible tasks and generates annotation-free feedback for LLM training.

Findings

01

Improves translation quality by 2.13 COMET points across 756 directions.

02

Increases mathematical reasoning accuracy by 6.4 points on benchmarks.

03

Enhances inference-time reranking performance by 9.3 points.

Abstract

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security