Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective
Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu

TL;DR
This paper argues that greedy decoding outperforms stochastic sampling in Visual Question Answering tasks, supported by theoretical analysis and extensive experiments, challenging common heuristics inherited from large language models.
Contribution
It provides a formal relationship between calibration and accuracy, derives conditions for greedy decoding optimality, and introduces a new decoding method for multimodal reasoning.
Findings
Greedy decoding outperforms stochastic sampling in VQA benchmarks.
Theoretical conditions for greedy decoding optimality are established.
A new decoding method surpasses existing heuristics in multimodal reasoning.
Abstract
Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
