VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
Mozhgan Nasr Azadani, Yimu Wang, Yongpeng Zhu, Lihong Chen, Milan Ganai, Sean Sedwards, Marco Pavone, Krzysztof Czarnecki

TL;DR
VISTAQA is a new benchmark that evaluates multimodal models on both answer correctness and pixel-level evidence grounding, emphasizing transparency and reliability in visual question answering.
Contribution
It introduces VISTAQA, a comprehensive dataset and GROVE, a joint evaluation metric for assessing reasoning and grounding in multimodal models.
Findings
Strong models show limited performance under joint evaluation.
Current systems often fail to align answers with visual evidence.
VISTAQA reveals a significant gap between answer accuracy and grounding quality.
Abstract
Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
