Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding

Yufei Yin; Jie Zheng; Qianke Meng; Zhou Yu; Minghao Chen; Jiajun Ding; Min Tan; Yuling Xi; Zhiwen Chen; and Chengfei Lv

arXiv:2604.26261·cs.CV·April 30, 2026

Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding

Yufei Yin, Jie Zheng, Qianke Meng, Zhou Yu, Minghao Chen, Jiajun Ding, Min Tan, Yuling Xi, Zhiwen Chen, and Chengfei Lv

PDF

TL;DR

This paper introduces MCM-VG, a novel framework for zero-shot 3D visual grounding that improves accuracy by establishing multiple consistent 2D-3D mappings and leveraging large language and vision models.

Contribution

MCM-VG explicitly enforces 2D-3D consistency across three dimensions to enhance zero-shot 3D visual grounding performance.

Findings

01

Achieves 62.0% accuracy at 0.25 IoU on ScanRefer, surpassing previous methods.

02

Sets new state-of-the-art results on ScanRefer and Nr3D benchmarks.

03

Effectively reconstructs missing targets and reduces spatial redundancy in 3D grounding.

Abstract

Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.