Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks
Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis

TL;DR
This paper assesses the zero-shot performance of GPT-4V on 3D visual question answering benchmarks, revealing that GPT models perform comparably to traditional methods and benefit from scene-specific vocabulary.
Contribution
It provides the first evaluation of GPT-4V on 3D VQA benchmarks and highlights the importance of scene-specific vocabulary in zero-shot settings.
Findings
GPT-4V matches traditional closed-vocabulary approaches without fine-tuning
Scene-specific vocabulary improves GPT-4V performance
Abstract
As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
