Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset
Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, Yang Liu

TL;DR
This paper introduces MV-ScanQA, a challenging 3D question answering dataset emphasizing multi-view reasoning, and TripAlign, a large-scale pre-training corpus that enhances multi-object understanding in 3D scene analysis.
Contribution
It presents a new multi-view reasoning dataset and a large-scale pre-training corpus, enabling models to better understand complex 3D scenes through multi-view and multi-object reasoning.
Findings
LEGO achieves state-of-the-art results on MV-ScanQA
TripAlign improves multi-view 3D reasoning performance
Models trained with TripAlign generalize well to existing benchmarks
Abstract
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
