S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
Beining Xu, Siting Zhu, Zhao Jin, Junxian Li, Hesheng Wang

TL;DR
S$^2$-MLLM is a novel framework that enhances the spatial reasoning capabilities of multi-modal large language models for 3D visual grounding by leveraging implicit structural guidance and attention mechanisms, improving efficiency and accuracy.
Contribution
The paper introduces an implicit spatial reasoning framework with a structure-enhanced module that improves 3D scene understanding without relying on point cloud reconstruction.
Findings
Achieves superior performance on ScanRefer, Nr3D, and Sr3D datasets.
Demonstrates improved efficiency over existing methods.
Unifies generalization and accuracy in 3D visual grounding.
Abstract
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Motion and Animation
