An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
L. Juli\'an Lechuga L\'opez, Farah E. Shamout, Tim G. J. Rudner

TL;DR
This study empirically examines the reliability of uncertainty-based selective prediction in multimodal clinical condition classification, revealing calibration issues that impair safety guarantees.
Contribution
It identifies a task-specific failure mode of selective prediction due to class-dependent miscalibration in multimodal clinical AI models.
Findings
Selective prediction can degrade performance despite strong metrics.
Models exhibit severe class-dependent miscalibration, especially for underrepresented conditions.
Standard aggregate metrics may hide these calibration issues.
Abstract
As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
