3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, Alan Yuille

TL;DR
This paper introduces 3DSRBench, a comprehensive benchmark with 2,772 questions to evaluate and analyze the 3D spatial reasoning capabilities of large multi-modal models across diverse viewpoints and question types.
Contribution
It presents the first extensive 3D spatial reasoning benchmark, including a novel evaluation strategy and viewpoint analysis, to assess LMMs' 3D understanding.
Findings
LMMs show limitations in height, orientation, and multi-object reasoning.
Performance degrades on images with uncommon 6D viewpoints.
Benchmark provides insights for improving 3D spatial reasoning in models.
Abstract
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning abilities by balancing data distribution and adopting a novel FlipEval strategy. To further study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
