RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Meng-Hao Guo; Xuanyu Chu; Qianrui Yang; Zhe-Han Mo; Yiqing Shen; Pei-lin Li; Xinjie Lin; Jinnian Zhang; Xin-Sheng Chen; Yi Zhang; Kiyohiro Nakayama; Zhengyang Geng; Houwen Peng; Han Hu; Shi-Min Hu

arXiv:2505.16770·cs.CV·May 26, 2025

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu

PDF

TL;DR

RBench-V is a new benchmark designed to evaluate the multi-modal reasoning abilities of vision-language models through complex questions requiring image manipulation and auxiliary reasoning, revealing current models' significant limitations.

Contribution

This paper introduces RBench-V, the first benchmark focused on assessing models' multi-modal reasoning with outputs, filling a gap in existing evaluation methods.

Findings

01

Current models achieve only 25.8% accuracy on RBench-V.

02

Models significantly underperform compared to human scores of 82.3%.

03

RBench-V reveals substantial challenges in multi-modal reasoning for state-of-the-art models.

Abstract

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus