Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

TL;DR
This paper introduces Multimodal RewardBench 2, a comprehensive benchmark for evaluating reward models on multimodal understanding and generation tasks involving interleaved text and images, highlighting current model performances and areas for improvement.
Contribution
The paper presents MMRB2, the first extensive benchmark for reward models on multimodal tasks, including a diverse set of tasks, expert-annotated data, and analysis of current reward model performances.
Findings
Gemini 3 Pro achieves 75-80% accuracy.
GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy.
Open-source Qwen3-VL-32B performs comparably to Gemini 2.5 Flash.
Abstract
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling
