Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, Asma Ben Abacha

TL;DR
This paper systematically investigates confidence calibration in medical vision-language models, revealing persistent overconfidence, the limited effectiveness of simple calibration methods, and proposing hallucination-aware calibration to improve reliability.
Contribution
It provides the first comprehensive empirical analysis of confidence calibration in medical VLMs and introduces hallucination-aware calibration to enhance trustworthiness.
Findings
Overconfidence persists across models and scales.
Post-hoc calibration methods outperform prompting strategies.
Hallucination signals improve calibration and AUROC, especially on open-ended questions.
Abstract
As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
