Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Ruihuang Li; Zhengqiang Zhang; Chenhang He; Zhiyuan Ma; Vishal M.; Patel; Lei Zhang

arXiv:2407.09781·cs.CV·July 16, 2024·1 cites

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M., Patel, Lei Zhang

PDF

Open Access

TL;DR

This paper introduces a Dense Multimodal Alignment framework that densely co-embeds multiple modalities into a common space, significantly improving open-vocabulary 3D scene understanding and segmentation performance.

Contribution

The novel DMA framework effectively combines vision, language, and image modalities using large vision-language models and dense associations, enhancing zero-shot 3D scene understanding.

Findings

01

Achieves state-of-the-art open-vocabulary segmentation results

02

Effective dense multimodal co-embedding for 3D understanding

03

Improves generalization of 2D models to 3D tasks

Abstract

Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training · Focus · Dual Multimodal Attention