Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
Jiangye Yuan, Gowri Kumar, Baoyuan Wang

TL;DR
This paper introduces GR3D, a geometrically referenced 3D scene representation, to enhance multimodal large language models' spatial reasoning capabilities without additional training.
Contribution
The authors propose a zero-shot method using GR3D that significantly improves MLLMs' 3D spatial reasoning on benchmark datasets.
Findings
Boosted GPT-5 performance by 9% on VSI-Bench.
Improved accuracy by 12% on MindCube.
Enabled complex spatial reasoning with sparse views.
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
