Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Xinyu Qiu; Heng Jia; Zhengwen Zeng; Shuheng Shen; Changhua Meng; Yi Yang; Linchao Zhu

arXiv:2601.01483·cs.CV·January 6, 2026

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu

PDF

Open Access

TL;DR

This paper introduces ADPO, a unified reinforcement learning framework that jointly optimizes answer generation and self-verification in vision-language models, reducing costs and improving verification accuracy.

Contribution

It proposes a novel decoupled optimization mechanism and preference verification reward for joint learning of generation and verification within a single policy.

Findings

01

Up to +34.1% higher verification AUC

02

-53.5% lower inference time

03

Significant accuracy and success rate improvements on multiple benchmarks

Abstract

Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Adversarial Robustness in Machine Learning