N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu

TL;DR
N3D-VLM introduces a unified framework that integrates native 3D object perception with visual reasoning, significantly enhancing spatial understanding and grounding accuracy in vision-language models.
Contribution
The paper presents a novel approach combining native 3D perception with spatial reasoning, supported by a large-scale 3D dataset generated through depth-based annotation lifting.
Findings
Achieves state-of-the-art results on 3D grounding tasks.
Surpasses existing methods in 3D spatial reasoning.
Uses a scalable data pipeline for diverse 3D annotations.
Abstract
While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
