Z3D: Zero-Shot 3D Visual Grounding from Images

Nikita Drozdov; Andrey Lemeshko; Nikita Gavrilov; Anton Konushin; Danila Rukhovich; Maksim Kolodiazhnyi

arXiv:2602.03361·cs.CV·February 4, 2026

Z3D: Zero-Shot 3D Visual Grounding from Images

Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

PDF

Open Access

TL;DR

Z3D introduces a zero-shot 3D visual grounding method using multi-view images, leveraging advanced segmentation and reasoning techniques to achieve state-of-the-art results without geometric supervision.

Contribution

The paper presents Z3D, a novel zero-shot 3D visual grounding pipeline that operates solely on multi-view images and incorporates modern VLMs for improved performance.

Findings

01

Achieves state-of-the-art zero-shot performance on ScanRefer and Nr3D.

02

Utilizes a new zero-shot 3D instance segmentation method.

03

Employs prompt-based segmentation for advanced reasoning.

Abstract

3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition