Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

Boqi Chen; Xudong Liu; Yunke Ao; Jianing Qiu

arXiv:2604.23443·cs.CL·April 28, 2026

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu

PDF

TL;DR

This paper argues that greedy decoding outperforms stochastic sampling in Visual Question Answering tasks, supported by theoretical analysis and extensive experiments, challenging common heuristics inherited from large language models.

Contribution

It provides a formal relationship between calibration and accuracy, derives conditions for greedy decoding optimality, and introduces a new decoding method for multimodal reasoning.

Findings

01

Greedy decoding outperforms stochastic sampling in VQA benchmarks.

02

Theoretical conditions for greedy decoding optimality are established.

03

A new decoding method surpasses existing heuristics in multimodal reasoning.

Abstract

Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.