Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift
Behraj Khan, Tahir Qasim Syed, Nouman M. Durrani, Bilal Naseem, Shabir Ahmad, Rizwan Qureshi

TL;DR
This paper introduces StaRFM, a method that enhances foundation models' robustness and calibration under distribution shifts in vision-language and medical imaging tasks, improving accuracy and uncertainty estimation.
Contribution
StaRFM combines Fisher information penalty and confidence misalignment penalty to address distribution shift and calibration issues across diverse vision and medical datasets.
Findings
+3.5% accuracy on vision datasets
28% lower calibration error (ECE)
+4.2% DSC on medical benchmarks
Abstract
Foundation models like CLIP and SAM have advanced computer vision and medical imaging via low-shot transfer learning, aiding CADD with limited data. However, their deployment faces two key challenges. \textit{distribution shift} where pre-training and post-training data distributions differ (e.g., due to inter-center image acquisition) and \textit{confidence misalignment}, which leads to overconfident errors. These issues surface differently, vision-language models (e.g., CLIP) suffer from 2D embedding shift (image-text misalignment), while medical models (e.g., SAM) encounter 3D domain shifts (e.g., scanner variation) and voxel-wise calibration need. Existing solutions are domain-specific. We propose \textbf{StaRFM}, a fusion of Fisher information penalty (FIP) and confidence misalignment penalty (CMP) tackling both challenges. It applies FIP, extended to 3D via patch-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSegment Anything Model · Contrastive Language-Image Pre-training
