DC-Scene: Data-Centric Learning for 3D Scene Understanding
Ting Huang, Zeyu Zhang, Ruicheng Zhang, Yang Zhao

TL;DR
DC-Scene introduces a data-centric approach for 3D scene understanding that improves training efficiency and performance by filtering high-quality data and reducing reliance on large datasets, demonstrated on ScanRefer and Nr3D.
Contribution
The paper proposes a novel CLIP-driven dual-indicator quality filter and curriculum scheduler to enhance data quality and training efficiency in 3D scene understanding.
Findings
Achieves state-of-the-art 86.1 CIDEr score with top-75% data subset.
Reduces training cost by approximately two-thirds.
High-quality data filtering outperforms using full datasets.
Abstract
3D scene understanding plays a fundamental role in vision applications such as robotics, autonomous driving, and augmented reality. However, advancing learning-based 3D scene understanding remains challenging due to two key limitations: (1) the large scale and complexity of 3D scenes lead to higher computational costs and slower training compared to 2D counterparts; and (2) high-quality annotated 3D datasets are significantly scarcer than those available for 2D vision. These challenges underscore the need for more efficient learning paradigms. In this work, we propose DC-Scene, a data-centric framework tailored for 3D scene understanding, which emphasizes enhancing data quality and training efficiency. Specifically, we introduce a CLIP-driven dual-indicator quality (DIQ) filter, combining vision-language alignment scores with caption-loss perplexity, along with a curriculum scheduler…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · 3D Surveying and Cultural Heritage · Advanced Vision and Imaging
MethodsSparse Evolutionary Training
