Evaluating Variance in Visual Question Answering Benchmarks
Nikitha SR

TL;DR
This paper investigates the significant performance variance in visual question answering benchmarks caused by stochasticity and hyperparameters, proposing variance-aware evaluation methods for more reliable assessment of multimodal models.
Contribution
It systematically analyzes sources of variance in VQA benchmarks and evaluates alternative assessment strategies like Cloze-style evaluation to improve reliability.
Findings
Performance variance is substantial across benchmarks due to stochastic factors.
Extended instruction finetuning influences model performance variability.
Cloze-style evaluation reduces stochasticity and enhances assessment reliability.
Abstract
Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of MLLMs on VQA benchmarks often relies on point estimates, overlooking the significant variance in performance caused by factors such as stochastic model outputs, training seed sensitivity, and hyperparameter configurations. This paper critically examines these issues by analyzing variance across 14 widely used VQA benchmarks, covering diverse tasks such as visual reasoning, text understanding, and commonsense reasoning. We systematically study the impact of training seed, framework non-determinism, model scale, and extended instruction finetuning on performance variability. Additionally, we explore Cloze-style evaluation as an alternate assessment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
