3D Concept Learning and Reasoning from Multi-View Images

Yining Hong; Chunru Lin; Yilun Du; Zhenfang Chen; Joshua B. Tenenbaum,; Chuang Gan

arXiv:2303.11327·cs.CV·March 21, 2023·1 cites

3D Concept Learning and Reasoning from Multi-View Images

Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum,, Chuang Gan

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large-scale 3D multi-view VQA benchmark, evaluates current models' limitations, and proposes a novel 3D concept learning framework that improves reasoning performance using neural fields and vision-language models.

Contribution

It presents the first large-scale 3D multi-view VQA benchmark and introduces a new 3D concept learning framework combining neural fields and vision-language models.

Findings

01

Current models perform poorly on the benchmark.

02

The proposed 3D-CLR framework outperforms baselines significantly.

03

Challenges in 3D reasoning remain largely unsolved.

Abstract

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ShuhongZheng/3D-CLR
dataset· 197 dl
197 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning