Multi-modal Situated Reasoning in 3D Scenes
Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong, Jia, Siyuan Huang

TL;DR
This paper introduces MSQA, a large-scale multi-modal dataset for situated reasoning in 3D scenes, and benchmarks multi-modal reasoning and navigation, revealing current model limitations and the potential of data scaling.
Contribution
It presents MSQA, a comprehensive multi-modal dataset for 3D scene reasoning, and introduces new benchmarks for multi-modal reasoning and navigation tasks.
Findings
Existing models struggle with multi-modal interleaved inputs.
Data scaling improves reasoning performance.
Pre-training on MSQA enhances model capabilities.
Abstract
Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Multimodal Machine Learning Applications · Semantic Web and Ontologies
