3D Concept Learning and Reasoning from Multi-View Images
Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum,, Chuang Gan

TL;DR
This paper introduces a large-scale 3D multi-view VQA benchmark, evaluates current models' limitations, and proposes a novel 3D concept learning framework that improves reasoning performance using neural fields and vision-language models.
Contribution
It presents the first large-scale 3D multi-view VQA benchmark and introduces a new 3D concept learning framework combining neural fields and vision-language models.
Findings
Current models perform poorly on the benchmark.
The proposed 3D-CLR framework outperforms baselines significantly.
Challenges in 3D reasoning remain largely unsolved.
Abstract
Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
