Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

Qiyuan Chen; Hongsen Huang; Jiahe Chen; Qian Shao; Jintai Chen; Hongxia Xu; Renjie Hua; Chuan Ren; Jian Wu

arXiv:2604.05445·cs.CL·April 8, 2026

Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen, Hongxia Xu, Renjie Hua, Chuan Ren, Jian Wu

PDF

TL;DR

This paper introduces VL-MDR, a dynamic, interpretable reward model for vision-language tasks that decomposes evaluation into relevant dimensions, improving transparency and performance.

Contribution

It proposes a novel multi-dimensional reward framework with a visual-aware gating mechanism and curated dataset, enhancing interpretability and alignment in vision-language models.

Findings

01

VL-MDR outperforms existing open-source reward models on VL-RewardBench.

02

Constructed preference pairs enable DPO alignment to reduce hallucinations.

03

The framework improves scalability and reliability in VLM alignment.

Abstract

Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.