3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, Shibiao Xu

TL;DR
3D-MoRe introduces a unified framework that leverages foundational models to generate large-scale 3D-language datasets, significantly improving reasoning and response quality in embodied question answering tasks within complex indoor scenes.
Contribution
The paper presents 3D-MoRe, a novel paradigm that combines multi-modal embedding, cross-modal interaction, and language modeling to generate extensive 3D-language datasets for indoor scene understanding.
Findings
Generated 62,000 QA pairs and 73,000 object descriptions across 1,513 scenes.
Achieved a 2.15% improvement in CIDEr score on ScanQA.
Improved [email protected] by 1.84% on ScanRefer.
Abstract
With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
