Guiding Vision-Language Model Selection for Visual Question-Answering   Across Tasks, Domains, and Knowledge Types

Neelabh Sinha; Vinija Jain; and Aman Chadha

arXiv:2409.09269·cs.CV·December 13, 2024

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha, Vinija Jain, and Aman Chadha

PDF

Open Access 1 Repo

TL;DR

This paper introduces VQA360, a comprehensive dataset and GoEval metric for evaluating vision-language models in visual question-answering across diverse tasks and domains, highlighting the importance of model selection.

Contribution

The paper presents VQA360 and GoEval, enabling standardized, comprehensive evaluation of VLMs for VQA, and analyzes model performance across various settings to guide model choice.

Findings

01

Proprietary models like Gemini-1.5-Pro and GPT-4o-mini outperform others.

02

Open-source models such as InternVL-2-8B show competitive strengths.

03

No single model excels universally across all tasks and domains.

Abstract

Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neelabhsinha/vlm-selection-tasks-domains-knowledge-type
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications