Calibrated Confidence Expression for Radiology Report Generation
David Bani-Harouni, Chantal Pellegrini, Julian L\"uers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler, Rickmer Braren, Nassir Navab, Matthias Keicher

TL;DR
This paper presents ConRad, a reinforcement learning framework that calibrates confidence estimates in radiology report generation by medical vision-language models, enhancing safety and interpretability.
Contribution
It introduces a novel RL-based method for producing calibrated verbalized confidence in radiology reports, improving over existing models and aligning with clinical judgment.
Findings
ConRad significantly improves calibration of confidence estimates.
Report-level scores align well with clinicians' judgments.
Supports safer AI-assisted report generation through targeted review.
Abstract
Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
