Do large language vision models understand 3D shapes?
Sagi Eppel

TL;DR
This paper evaluates whether large vision language models can understand 3D shapes by testing their ability to recognize and match objects across different orientations and materials, revealing partial but limited understanding.
Contribution
The study introduces a novel benchmark for assessing 3D shape understanding in LVLMs and provides empirical results comparing model performance to human perception.
Findings
Models outperform random guesses but lag behind humans in 3D shape recognition.
Models easily recognize objects with different orientations or materials individually.
Performance drops significantly when both orientation and material change simultaneously.
Abstract
Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world. Models such as GPT, Claude, Gemini, and LLama can use images to understand and analyze complex visual scenes. 3D objects and shapes are the basic building blocks of the world, recognizing them is a fundamental part of human perception. The goal of this work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes but with different orientations and materials/textures. A large number of test images were created using CGI with a huge number of highly diverse objects, materials, and scenes. The results of this test show that the ability of such models to match 3D shapes is significantly below humans but much higher than random guesses. Suggesting that the models have gained some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Weight Decay
