Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering   Benchmarks

Simranjit Singh; Georgios Pavlakos; Dimitrios Stamoulis

arXiv:2405.18831·cs.CV·May 30, 2024·1 cites

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis

PDF

Open Access

TL;DR

This paper assesses the zero-shot performance of GPT-4V on 3D visual question answering benchmarks, revealing that GPT models perform comparably to traditional methods and benefit from scene-specific vocabulary.

Contribution

It provides the first evaluation of GPT-4V on 3D VQA benchmarks and highlights the importance of scene-specific vocabulary in zero-shot settings.

Findings

01

GPT-4V matches traditional closed-vocabulary approaches without fine-tuning

02

Scene-specific vocabulary improves GPT-4V performance

Abstract

As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications