Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding
Yufei Yin, Jie Zheng, Qianke Meng, Zhou Yu, Minghao Chen, Jiajun Ding, Min Tan, Yuling Xi, Zhiwen Chen, and Chengfei Lv

TL;DR
This paper introduces MCM-VG, a novel framework for zero-shot 3D visual grounding that improves accuracy by establishing multiple consistent 2D-3D mappings and leveraging large language and vision models.
Contribution
MCM-VG explicitly enforces 2D-3D consistency across three dimensions to enhance zero-shot 3D visual grounding performance.
Findings
Achieves 62.0% accuracy at 0.25 IoU on ScanRefer, surpassing previous methods.
Sets new state-of-the-art results on ScanRefer and Nr3D benchmarks.
Effectively reconstructs missing targets and reduces spatial redundancy in 3D grounding.
Abstract
Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
