Towards Foundation Models for 3D Vision: How Close Are We?
Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, Thomas L., Griffiths

TL;DR
This paper evaluates current 3D vision models and humans using a new benchmark, revealing gaps in model robustness and alignment with human vision, and highlights the potential of Transformer-based architectures.
Contribution
It introduces UniQA-3D, a comprehensive benchmark for 3D visual understanding, and provides insights into the capabilities and limitations of current models compared to humans.
Findings
VLMs perform poorly on 3D tasks
Specialized models lack robustness under geometric perturbations
Transformers like ViT align more closely with human 3D vision mechanisms
Abstract
Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage
MethodsALIGN
