Z3D: Zero-Shot 3D Visual Grounding from Images
Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

TL;DR
Z3D introduces a zero-shot 3D visual grounding method using multi-view images, leveraging advanced segmentation and reasoning techniques to achieve state-of-the-art results without geometric supervision.
Contribution
The paper presents Z3D, a novel zero-shot 3D visual grounding pipeline that operates solely on multi-view images and incorporates modern VLMs for improved performance.
Findings
Achieves state-of-the-art zero-shot performance on ScanRefer and Nr3D.
Utilizes a new zero-shot 3D instance segmentation method.
Employs prompt-based segmentation for advanced reasoning.
Abstract
3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition
