Towards Foundation Models for 3D Vision: How Close Are We?

Yiming Zuo; Karhan Kayan; Maggie Wang; Kevin Jeon; Jia Deng; Thomas L.; Griffiths

arXiv:2410.10799·cs.CV·December 10, 2024

Towards Foundation Models for 3D Vision: How Close Are We?

Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, Thomas L., Griffiths

PDF

Open Access 2 Repos

TL;DR

This paper evaluates current 3D vision models and humans using a new benchmark, revealing gaps in model robustness and alignment with human vision, and highlights the potential of Transformer-based architectures.

Contribution

It introduces UniQA-3D, a comprehensive benchmark for 3D visual understanding, and provides insights into the capabilities and limitations of current models compared to humans.

Findings

01

VLMs perform poorly on 3D tasks

02

Specialized models lack robustness under geometric perturbations

03

Transformers like ViT align more closely with human 3D vision mechanisms

Abstract

Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage

MethodsALIGN