Do 3D Large Language Models Really Understand 3D Spatial Relationships?
Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu, Angel X Chang, Iro Armeni, Iro Laina, Songyou Peng, Victor Adrian Prisacariu

TL;DR
This paper critically evaluates the true 3D understanding of large language models, introduces a rigorous benchmark to test spatial reasoning, and proposes training methods to improve genuine 3D comprehension.
Contribution
It reveals that current models may rely on textual shortcuts, introduces the Real-3DQA benchmark for better evaluation, and proposes a reweighted training approach to enhance 3D spatial reasoning.
Findings
Existing 3D-LLMs struggle with spatial reasoning when cues are removed.
Fine-tuning on text-only data can outperform specialized 3D models on some benchmarks.
The proposed training method improves models' reliance on 3D visual cues.
Abstract
Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues,…
Peer Reviews
Decision·ICLR 2026 Poster
- The finding that text-only models perform nearly as well as 3D-LLMs on standard benchmarks is valuable to the community. - Real-3DQA improves evaluation fairness by filtering 3D-independent samples and introducing rotation consistency, which is sound. - The authors evaluate multiple existing 3D-LLMs and provide both quantitative and qualitative analyses that convincingly demonstrate the identified problem.
Limited novelty in methodology: The paper diagnoses dataset bias effectively but does not propose new modeling architectures or mechanisms to fundamentally improve 3D reasoning. 1. The Real-3DQA benchmark largely refines existing datasets via filtering and simple viewpoint augmentations; while useful, it feels more like an engineering refinement than a conceptual leap. 2. The 3DR-FT method adds a weighting term to the loss based on text–3D prediction discrepancy; the idea is intuitive but techni
The paper is original in framing “real” 3D understanding via the Real-3DQA pipeline and the rotation-consistency metric (VRS), moving beyond shortcut-prone benchmarks. Methodologically it’s solid: the blind-vs-vision contrast, rotated rephrasings, and multi-stage QC make the evidence credible, and 3DR-FT is a clear, effective objective. The writing and figures communicate the pipeline and metrics cleanly. Empirically, the work exposes meaningful brittleness in 3D-LLMs and shows a practical path
Rotation robustness is evaluated using GPT-generated viewpoint texts. Although quality control is thorough, this approach still relies on linguistic rather than geometric variation and limits the test’s realism. Using scene-graph or pose-based rotations with automatic answer recomputation would more directly assess spatial consistency. Coverage is limited to four fixed views; expanding to denser yaw and pitch/roll, plus reporting uncertainty would strengthen claims. The Real-3DQA set is much sma
- The paper is clearly written and easy to follow, with a well-structured presentation of ideas. The authors’ approach to analyzing the limitations of current 3D-LLMs is insightful and well-motivated, making the paper engaging to read. - It provides a comprehensive and well-organized overview of existing 3D-LLMs and their evaluation benchmarks. - The proposed Real-3DQA dataset and VRS metric are meaningful and valuable additions to the 3D-LLM research community. The comparative evaluation of exi
- The proposed 3D-aware Reweighted Fine-Tuning (3DR-FT) method improves performance on the Real-3DQA benchmark but results in degraded performance on the original datasets. This suggests that the approach may either reduce the model’s generalization to language-prior-heavy questions or bias it toward producing answers that deviate from the dominant linguistic mode when 3D cues are absent. However, in real-world scenarios, many questions that 3D-LLMs encounter are likely to rely heavily on lingui
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
