Visual Preference Optimization with Rubric Rewards
Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

TL;DR
This paper introduces rDPO, a preference optimization framework using instance-specific rubrics for improved visual reasoning in multimodal tasks, significantly enhancing model performance.
Contribution
The paper presents a novel rubric-based preference optimization method that leverages checklist-style criteria, outperforming existing outcome-based approaches in visual tasks.
Findings
Rubric-based prompting improves a 30B-A3B judge, nearing GPT-5.4 performance.
Rubric-based filtering increases macro average to 82.69, outperforming outcome-based filtering.
rDPO outperforms style-constrained baseline and base model on comprehensive benchmarks.
Abstract
The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
