Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy
Simon Ging, Mar\'ia A. Bravo, Thomas Brox

TL;DR
This paper introduces a new VQA benchmarking approach using classification datasets and semantic hierarchies, enabling more detailed evaluation of vision-language models' capabilities and comparison with discriminative models.
Contribution
It proposes a novel VQA benchmark leveraging classification datasets and semantic hierarchies, along with evaluation metrics informed by human judgment.
Findings
Granular evaluation of vision-language models on object, action, and attribute classification.
Semantic hierarchies improve assessment of coarse answers in fine-grained tasks.
Comparison of NLP and LLM-based metrics for model evaluation.
Abstract
The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
