Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models
Binesh Sadanandan, Vahid Behzadan

TL;DR
This paper investigates how medical vision-language models' confidence calibration and sensitivity to question rephrasing are linked to their proximity to decision boundaries, proposing entropy-based measures for reliability.
Contribution
It demonstrates that predictive entropy from a single forward pass can effectively identify unreliable and rephrase-sensitive predictions in medical VLMs, with simple methods outperforming complex ensembles.
Findings
Predictive entropy predicts rephrasing failures with AUROC 0.711 on MedGemma.
Single-pass entropy outperforms ensembles in error detection AUROC 0.743.
MC Dropout achieves best calibration with ECE 4.3 and 21.5% coverage at 5% risk.
Abstract
Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
