VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

Mozhgan Nasr Azadani; Yimu Wang; Yongpeng Zhu; Lihong Chen; Milan Ganai; Sean Sedwards; Marco Pavone; Krzysztof Czarnecki

arXiv:2605.20676·cs.CV·May 21, 2026

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

Mozhgan Nasr Azadani, Yimu Wang, Yongpeng Zhu, Lihong Chen, Milan Ganai, Sean Sedwards, Marco Pavone, Krzysztof Czarnecki

PDF

TL;DR

VISTAQA is a new benchmark that evaluates multimodal models on both answer correctness and pixel-level evidence grounding, emphasizing transparency and reliability in visual question answering.

Contribution

It introduces VISTAQA, a comprehensive dataset and GROVE, a joint evaluation metric for assessing reasoning and grounding in multimodal models.

Findings

01

Strong models show limited performance under joint evaluation.

02

Current systems often fail to align answers with visual evidence.

03

VISTAQA reveals a significant gap between answer accuracy and grounding quality.

Abstract

Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.