Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Behraj Khan; Tahir Qasim Syed; Nouman M. Durrani; Bilal Naseem; Shabir Ahmad; Rizwan Qureshi

arXiv:2507.09222·cs.CV·July 22, 2025

Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Behraj Khan, Tahir Qasim Syed, Nouman M. Durrani, Bilal Naseem, Shabir Ahmad, Rizwan Qureshi

PDF

TL;DR

This paper introduces StaRFM, a method that enhances foundation models' robustness and calibration under distribution shifts in vision-language and medical imaging tasks, improving accuracy and uncertainty estimation.

Contribution

StaRFM combines Fisher information penalty and confidence misalignment penalty to address distribution shift and calibration issues across diverse vision and medical datasets.

Findings

01

+3.5% accuracy on vision datasets

02

28% lower calibration error (ECE)

03

+4.2% DSC on medical benchmarks

Abstract

Foundation models like CLIP and SAM have advanced computer vision and medical imaging via low-shot transfer learning, aiding CADD with limited data. However, their deployment faces two key challenges. \textit{distribution shift} where pre-training and post-training data distributions differ (e.g., due to inter-center image acquisition) and \textit{confidence misalignment}, which leads to overconfident errors. These issues surface differently, vision-language models (e.g., CLIP) suffer from 2D embedding shift (image-text misalignment), while medical models (e.g., SAM) encounter 3D domain shifts (e.g., scanner variation) and voxel-wise calibration need. Existing solutions are domain-specific. We propose \textbf{StaRFM}, a fusion of Fisher information penalty (FIP) and confidence misalignment penalty (CMP) tackling both challenges. It applies FIP, extended to 3D via patch-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSegment Anything Model · Contrastive Language-Image Pre-training