VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li; Yuancheng Wei; Zhihui Xie; Xuqing Yang; Yifan Song; Peiyi Wang; Chenxin An; Tianyu Liu; Sujian Li; Bill Yuchen Lin; Lingpeng Kong; Qi Liu

arXiv:2411.17451·cs.CV·June 3, 2025·2 cites

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu

PDF

Open Access 1 Models 2 Datasets

TL;DR

VL-RewardBench is a new, challenging benchmark designed to evaluate vision-language generative reward models, revealing their limitations and guiding future improvements through comprehensive, high-quality multimodal evaluation tasks.

Contribution

The paper introduces VL-RewardBench, a novel benchmark with curated high-quality examples to effectively assess and challenge VL-GenRMs, addressing biases and limitations of existing evaluation methods.

Findings

01

VL-RewardBench effectively challenges current VL-GenRMs, with GPT-4o achieving only 65.4% accuracy.

02

State-of-the-art open-source models struggle to outperform random guessing on the benchmark.

03

Performance correlates strongly with MMMU-Pro accuracy, validating the benchmark's effectiveness.

Abstract

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline that combines sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe VL-GenRMs limitations. Comprehensive evaluation across 16 leading large vision-language models demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
internlm/internlm-xcomposer2d5-7b-reward
model· 225 dl· ♡ 11
225 dl♡ 11

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling