VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu

TL;DR
VL-RewardBench is a new, challenging benchmark designed to evaluate vision-language generative reward models, revealing their limitations and guiding future improvements through comprehensive, high-quality multimodal evaluation tasks.
Contribution
The paper introduces VL-RewardBench, a novel benchmark with curated high-quality examples to effectively assess and challenge VL-GenRMs, addressing biases and limitations of existing evaluation methods.
Findings
VL-RewardBench effectively challenges current VL-GenRMs, with GPT-4o achieving only 65.4% accuracy.
State-of-the-art open-source models struggle to outperform random guessing on the benchmark.
Performance correlates strongly with MMMU-Pro accuracy, validating the benchmark's effectiveness.
Abstract
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline that combines sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe VL-GenRMs limitations. Comprehensive evaluation across 16 leading large vision-language models demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
