Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features
Alberto Testoni, Iacer Calixto

TL;DR
This paper benchmarks uncertainty estimation methods for clinical question answering across multiple specialties, question types, and models, revealing that uncertainty reliability varies with context and emphasizing the importance of model selection and ensemble strategies.
Contribution
It introduces a comprehensive evaluation of uncertainty quantification in clinical LLMs, including a novel lightweight behavioral feature-based method and analysis of conformal prediction.
Findings
Uncertainty reliability varies by clinical specialty and question type.
Model ensemble strategies can leverage complementary strengths.
Behavioral features improve uncertainty estimation in reasoning models.
Abstract
Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models), alongside representative proprietary models. We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
