Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M., Patel, Lei Zhang

TL;DR
This paper introduces a Dense Multimodal Alignment framework that densely co-embeds multiple modalities into a common space, significantly improving open-vocabulary 3D scene understanding and segmentation performance.
Contribution
The novel DMA framework effectively combines vision, language, and image modalities using large vision-language models and dense associations, enhancing zero-shot 3D scene understanding.
Findings
Achieves state-of-the-art open-vocabulary segmentation results
Effective dense multimodal co-embedding for 3D understanding
Improves generalization of 2D models to 3D tasks
Abstract
Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Focus · Dual Multimodal Attention
