Open-ended VQA benchmarking of Vision-Language models by exploiting   Classification datasets and their semantic hierarchy

Simon Ging; Mar\'ia A. Bravo; Thomas Brox

arXiv:2402.07270·cs.CV·May 7, 2024·2 cites

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Simon Ging, Mar\'ia A. Bravo, Thomas Brox

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new VQA benchmarking approach using classification datasets and semantic hierarchies, enabling more detailed evaluation of vision-language models' capabilities and comparison with discriminative models.

Contribution

It proposes a novel VQA benchmark leveraging classification datasets and semantic hierarchies, along with evaluation metrics informed by human judgment.

Findings

01

Granular evaluation of vision-language models on object, action, and attribute classification.

02

Semantic hierarchies improve assessment of coarse answers in fine-grained tasks.

03

Comparison of NLP and LLM-based metrics for model evaluation.

Abstract

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lmb-freiburg/ovqa
pytorchOfficial

Videos

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications