Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs

Rui Zhu; Xin Shen; Shuchen Wu; Chenxi Miao; Xin Yu; Yang Li; Weikang Li; Deguo Xia; Jizhou Huang

arXiv:2601.09430·cs.CV·January 15, 2026

Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs

Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang

PDF

Open Access

TL;DR

This paper introduces Video-MSR, a new benchmark for evaluating multi-hop spatial reasoning in dynamic videos, revealing current model limitations and improving capabilities through specialized instruction tuning.

Contribution

The paper presents the first benchmark for multi-hop spatial reasoning in videos and demonstrates how instruction tuning enhances model performance on this challenging task.

Findings

01

Models perform well on perception but struggle with multi-step spatial reasoning.

02

Significant performance drops observed in MSR tasks across state-of-the-art models.

03

Fine-tuning with MSR-9K improves model accuracy by 7.82%.

Abstract

Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Constraint Satisfaction and Optimization