Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha, Vinija Jain, and Aman Chadha

TL;DR
This paper introduces VQA360, a comprehensive dataset and GoEval metric for evaluating vision-language models in visual question-answering across diverse tasks and domains, highlighting the importance of model selection.
Contribution
The paper presents VQA360 and GoEval, enabling standardized, comprehensive evaluation of VLMs for VQA, and analyzes model performance across various settings to guide model choice.
Findings
Proprietary models like Gemini-1.5-Pro and GPT-4o-mini outperform others.
Open-source models such as InternVL-2-8B show competitive strengths.
No single model excels universally across all tasks and domains.
Abstract
Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
