Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

Ji Young Byun; Young-Jin Park; Jean-Philippe Corbeil; Asma Ben Abacha

arXiv:2604.02543·cs.CV·April 6, 2026

Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, Asma Ben Abacha

PDF

TL;DR

This paper systematically investigates confidence calibration in medical vision-language models, revealing persistent overconfidence, the limited effectiveness of simple calibration methods, and proposing hallucination-aware calibration to improve reliability.

Contribution

It provides the first comprehensive empirical analysis of confidence calibration in medical VLMs and introduces hallucination-aware calibration to enhance trustworthiness.

Findings

01

Overconfidence persists across models and scales.

02

Post-hoc calibration methods outperform prompting strategies.

03

Hallucination signals improve calibration and AUROC, especially on open-ended questions.

Abstract

As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.