Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Nishanth Madhusudhan, Vikas Yadav, Alexandre Lacoste

TL;DR
This paper introduces MM-AQA, a benchmark for evaluating when multimodal systems should abstain from answering, revealing that current models rarely abstain and that training for abstention is essential.
Contribution
The paper presents a new benchmark for assessing abstention in multimodal reasoning systems and analyzes the limitations of current models in recognizing unanswerable instances.
Findings
Models rarely abstain under standard prompting.
MAS improves abstention but causes an accuracy-abstention trade-off.
Sequential designs match or outperform iterative ones, indicating calibration issues.
Abstract
Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
