PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding
Seongmin Jung, Seongho Choi, Gunwoo Jeon, Minsu Cho, Jongwoo Lim

TL;DR
PanoGrounder introduces a panoramic scene representation method that leverages pretrained 2D vision-language models for improved 3D visual grounding, enhancing generalization and reasoning in robotics applications.
Contribution
The paper presents a novel panoramic representation framework that integrates 2D VLMs with 3D scene understanding, enabling better generalization and reasoning in 3D visual grounding tasks.
Findings
Achieves state-of-the-art results on ScanRefer and Nr3D datasets.
Demonstrates superior generalization to unseen 3D datasets.
Effectively handles text rephrasings in 3D grounding.
Abstract
3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Natural Language Processing Techniques
