3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma; Haoyu Chen; Guofeng Zhang; Yu-Cheng Chou; Jieneng Chen; Celso M de Melo; Alan Yuille

arXiv:2412.07825·cs.CV·September 17, 2025

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso M de Melo, Alan Yuille

PDF

Open Access 2 Datasets

TL;DR

This paper introduces 3DSRBench, a comprehensive benchmark with 2,772 questions to evaluate and analyze the 3D spatial reasoning capabilities of large multi-modal models across diverse viewpoints and question types.

Contribution

It presents the first extensive 3D spatial reasoning benchmark, including a novel evaluation strategy and viewpoint analysis, to assess LMMs' 3D understanding.

Findings

01

LMMs show limitations in height, orientation, and multi-object reasoning.

02

Performance degrades on images with uncommon 6D viewpoints.

03

Benchmark provides insights for improving 3D spatial reasoning in models.

Abstract

3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning abilities by balancing data distribution and adopting a novel FlipEval strategy. To further study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques