Unexplored flaws in multiple-choice VQA evaluations

Fabio Rosenthal; Sebastian Schmidt; Thorsten Graf; Thorsten Bagodonat; Stephan G\"unnemann; Leo Schwinn

arXiv:2511.22341·cs.CV·December 1, 2025

Unexplored flaws in multiple-choice VQA evaluations

Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, Thorsten Bagodonat, Stephan G\"unnemann, Leo Schwinn

PDF

Open Access

TL;DR

This paper uncovers previously unexamined biases in prompt formatting that significantly affect the reliability of multiple-choice VQA evaluations across various models and datasets, revealing a need for improved assessment methods.

Contribution

The study identifies three key prompt formatting biases in multiple-choice VQA and demonstrates their impact across seven models and five datasets, highlighting limitations of current bias mitigation strategies.

Findings

01

Prompt formatting biases significantly influence VQA performance.

02

Biases persist regardless of answer order or model confidence.

03

Existing mitigation methods do not address these new biases.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $seven$ MLLMs and $five$ VQA datasets, spanning $48$ distinct $prompt format variations$ . Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning