VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun

TL;DR
VideoReasonBench is a new benchmark designed to evaluate vision-centric, complex video reasoning, revealing that most multimodal LLMs perform poorly on such tasks, with extended reasoning improving performance significantly.
Contribution
The paper introduces VideoReasonBench, a benchmark for complex, vision-centric video reasoning, and provides a comprehensive evaluation of 18 multimodal LLMs highlighting their limitations and potential improvements.
Findings
Most models perform poorly on complex video reasoning tasks.
Gemini-2.5-Pro achieves 56.0% accuracy, outperforming others.
Extended reasoning budgets are crucial for better performance.
Abstract
Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video…
Peer Reviews
Decision·ICLR 2026 Poster
1. The work effectively identifies a key deficiency in current video benchmarks: weak alignment with reasoning and low demand for exploiting video information. For instance, Table 2 shows that using raw video input underperforms a video-to-text pipeline, and Table 3 indicates that prior benchmarks can be partially solved with text-only inputs. 2. As a benchmark for assessing MLLM reasoning, it demonstrates a positive correlation between accuracy and scaled reasoning length, which aligns with exp
1. The task format is relatively narrow, covering only six highly customized scenarios. If models are exposed to similar data during training, they may quickly overfit or “hack” the benchmark. 2. The substantial gap between open-source and closed-source models in Table 2 warrants analysis. Is this driven by differences in training data coverage or by capability gaps? For example, open-source models may have been trained on fewer or less diverse data and thus not encountered similarly customized
The benchmark covers both synthetic and real-world capture paths, including Matplotlib programmatic renderings, terminal screenshots, and manual videos, with balanced distributions across skills and demos and controlled settings for state size and operation length to adjust difficulty.
W1: For five of six skills, correctness is determined by a text-only LLM judge given GT and the model output. Even with careful prompts, LLM graders can show bias or instability and may prefer format-matched answers over truly correct content. The paper does not report inter-judge agreement, self-consistency, or adversarial sensitivity such as paraphrase, verbosity, or distractors. This is quite concerning. For Predict Operation, answers are extracted using an LLM, then simulated. This introduce
1. The benchmark formalizes video as “latent state + a sequence of visible operations,” decomposed into three reasoning levels with six concrete skills, which makes failure modes inspectable and comparable. 2. The study evaluates a wide span of open-source MLLMs
1. Since most tasks are scored with a judge LLM, would you consider reporting a brief robustness study (e.g., prompt ablations, temperature sweeps, and an alternative judge model) and an agreement statistic? 2. To make cross-model comparisons easier, could you add some description about budget setting, such as fps/frame count and max tokens? 3. The results indicate accuracy drops as state size and operation count increase, and when the final state is only revealed at the end. What may happen i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
