DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation
Zirui Wang, Tao Zhang

TL;DR
DenseScan introduces a richly annotated 3D scene dataset with multi-level descriptions and question generation, leveraging multi-view images and large language models to enhance 3D understanding tasks.
Contribution
It presents a novel automated pipeline for dense semantic annotation of 3D scenes, combining geometric and semantic information for improved visual-language applications.
Findings
Enhanced object-level understanding in 3D environments.
Improved question-answering performance over traditional datasets.
Broader applicability to downstream tasks like navigation and AR.
Abstract
3D understanding is a key capability for real-world AI assistance. High-quality data plays an important role in driving the development of the 3D understanding community. Current 3D scene understanding datasets often provide geometric and instance-level information, yet they lack the rich semantic annotations necessary for nuanced visual-language tasks.In this work, we introduce DenseScan, a novel dataset with detailed multi-level descriptions generated by an automated pipeline leveraging multi-view 2D images and multimodal large language models (MLLMs). Our approach enables dense captioning of scene elements, ensuring comprehensive object-level descriptions that capture context-sensitive details. Furthermore, we extend these annotations through scenario-based question generation, producing high-level queries that integrate object properties, spatial relationships, and scene context. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
