Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset

Wentao Mo; Qingchao Chen; Yuxin Peng; Siyuan Huang; Yang Liu

arXiv:2508.11058·cs.CV·August 18, 2025

Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset

Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, Yang Liu

PDF

TL;DR

This paper introduces MV-ScanQA, a challenging 3D question answering dataset emphasizing multi-view reasoning, and TripAlign, a large-scale pre-training corpus that enhances multi-object understanding in 3D scene analysis.

Contribution

It presents a new multi-view reasoning dataset and a large-scale pre-training corpus, enabling models to better understand complex 3D scenes through multi-view and multi-object reasoning.

Findings

01

LEGO achieves state-of-the-art results on MV-ScanQA

02

TripAlign improves multi-view 3D reasoning performance

03

Models trained with TripAlign generalize well to existing benchmarks

Abstract

The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.