Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D   Segmentation

Junha Lee; Chunghyun Park; Jaesung Choe; Yu-Chiang Frank Wang; Jan; Kautz; Minsu Cho; Chris Choy

arXiv:2502.02548·cs.CV·April 16, 2025

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan, Kautz, Minsu Cho, Chris Choy

PDF

Open Access

TL;DR

Mosaic3D introduces a large-scale 3D dataset and a novel foundation model for open-vocabulary 3D segmentation, leveraging advanced data generation and contrastive learning to improve scene understanding.

Contribution

The paper presents Mosaic3D, a new dataset of 5.6 million mask-text pairs and a foundation model that advances open-vocabulary 3D segmentation capabilities.

Findings

01

Achieves state-of-the-art results on multiple 3D segmentation benchmarks.

02

Demonstrates the effectiveness of large-scale data and contrastive training.

03

Validates the approach through comprehensive ablation studies.

Abstract

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Image Processing and 3D Reconstruction · Handwritten Text Recognition Techniques

MethodsContrastive Learning