Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT
Zhuobai Dong, Junchao Yi, Ziyuan Zheng, Haochen Han, Xiangxi Zheng, Alex Jinpeng Wang, Fangming Liu, Linjie Li

TL;DR
This paper introduces MVPBench, a new benchmark for evaluating multimodal models' ability to perform visual physical reasoning through multi-step, evidence-based reasoning paths, revealing current models' significant shortcomings.
Contribution
The paper presents MVPBench, a graph-based CoT evaluation framework, and a novel metric for assessing physical reasoning consistency in multimodal models.
Findings
Current MLLMs show poor physical reasoning accuracy.
RL-based fine-tuning can harm spatial reasoning.
Models struggle with visual physical reasoning in complex scenes.
Abstract
Understanding the physical world - governed by laws of motion, spatial relations, and causality - poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each example features interleaved multi-image inputs and…
Peer Reviews
Decision·Submitted to ICLR 2026
- Evaluating the reasoning steps (and not just the final answer) is an important aspect of reasoning capabilities assessment. The authors proposed method and benchmark fills this important gap. - Annotating and manually checking the reasoning steps is a time consuming task for more than 1k samples in the becnhmark. - The experiment results on reinforcement learning–based fine-tuned models is interesting.
Although the benchmark and tasks seem interesting, I am not convinced of the actual quality of the samples, which is the most crucial aspect of a benchmark. For instance, the very first example shown in the paper (Figure 1) seems to be flawed. The correct answer is that the bus is moving downwards (so the front is at the bottom) but the step_1 in Textual CoT says "the tail is near the bottom of the picture". Also, Step 1 in Model reasoning (GPT-4o), correctly says "The front of the vehicle is at
- The introduction of MVPBench, a new benchmark for evaluating visual physical reasoning with multi-image inputs and a graph-based consistency metric, provides a rigorous framework for assessing multimodal models’ reasoning capabilities. - The paper offers valuable insights into the current weaknesses of state-of-the-art models in visual physical reasoning, particularly in understanding basic physical laws and spatial interactions, and challenges existing practices in reinforcement learning-base
- In Table 3, it appears that closed-source LLMs generally outperform their open-source counterparts. This raises the question of whether the advanced reasoning capabilities of new models are already (or partially) addressing the tasks in question. To investigate, I reviewed the example in Figure 1 for both GPT-5 (thinking) and Gemini-Pro 2.5, and found that both models were able to correctly answer the question. Notably, Gemini-Pro 2.5 even demonstrated the ability to correctly identify the mov
- The paper jointly assesses CoT quality, diversity, and efficiency, enabling a fine-grained measurement of model behavior. - MVPBench provides multiple annotated reasoning paths for each question. This design better reflects real-world human problem-solving patterns and allows graph-based metrics. - The benchmark removes the textual cues to force the model to use visual evidence for reasoning. This helps reduce the text priors as a shortcut and isolates the visual reasoning ability.
- Figure 1 incorrectly labels GPT-4o’s Step 1 as wrong. However, step 1 in textual CoT is wrong but is considered correct. - The first subsection in the related work, “Limitations of Multi-modal Large Language Models”, is too broad and unfocused. The authors only discuss the limitations of physical understanding. - The same symbol \alpha is reused in both CRS and Path Coverage Scores without clear differentiation. Moreover, the paper first uses “DAG-based matching” (L299–300) before defining. -
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
