Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu; Reyhane Askari-Hemmat; Melissa Hall; Emily Dinan; Luke Zettlemoyer; Marjan Ghazvininejad

arXiv:2512.16899·cs.CL·January 21, 2026

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces Multimodal RewardBench 2, a comprehensive benchmark for evaluating reward models on multimodal understanding and generation tasks involving interleaved text and images, highlighting current model performances and areas for improvement.

Contribution

The paper presents MMRB2, the first extensive benchmark for reward models on multimodal tasks, including a diverse set of tasks, expert-annotated data, and analysis of current reward model performances.

Findings

01

Gemini 3 Pro achieves 75-80% accuracy.

02

GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy.

03

Open-source Qwen3-VL-32B performs comparably to Gemini 2.5 Flash.

Abstract

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Bhavkumar21/mmrb2-mj1-checkpoint-results
model

Datasets

rl-research/multimodal-rewardbench-2
dataset· 100 dl
100 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling