MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Ruiyuan Lyu; Jingli Lin; Tai Wang; Shuai Yang; Xiaohan Mao; Yilun Chen; Runsen Xu; Haifeng Huang; Chenming Zhu; Dahua Lin; Jiangmiao Pang

arXiv:2406.09401·cs.CV·June 10, 2025

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Ruiyuan Lyu, Jingli Lin, Tai Wang, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MMScan, the largest multi-modal 3D scene dataset with hierarchical grounded language annotations, enabling advanced 3D perception, understanding, and benchmarking for vision-language tasks.

Contribution

It creates the first large-scale multi-modal 3D dataset with hierarchical annotations, combining VLMs and human correction for comprehensive spatial and attribute understanding.

Findings

01

High-quality dataset with 1.4M captions and 3.04M samples

02

Improved performance of state-of-the-art models on benchmarks

03

Analysis of model capabilities and future challenges

Abstract

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openrobotlab/embodiedscan
pytorchOfficial

Videos

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations· slideslive

Taxonomy

TopicsImage Processing and 3D Reconstruction · Robotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage

MethodsFocus