MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang

TL;DR
This paper introduces MultihopSpatial, a comprehensive benchmark for multi-hop spatial reasoning in vision-language models, along with a new evaluation metric and a training corpus, revealing current limitations and improvements through reinforcement learning.
Contribution
The paper presents a novel benchmark, a new evaluation metric, and a large-scale training corpus for multi-hop spatial reasoning in vision-language models, addressing existing gaps.
Findings
Current VLMs struggle with multi-hop spatial reasoning.
Reinforcement learning improves spatial reasoning and manipulation performance.
MultihopSpatial-Train enhances models' spatial intelligence.
Abstract
Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
