Robust Calibration of Large Vision-Language Adapters
Balamurali Murugesan, Julio Silva-Rodriguez, Ismail Ben Ayed, and Jose, Dolz

TL;DR
This paper identifies and addresses the miscalibration issue in CLIP-based model adaptation, especially for out-of-distribution samples, proposing a simple, model-agnostic logit scaling method that improves calibration without sacrificing accuracy.
Contribution
It reveals the cause of miscalibration in CLIP adaptation methods and introduces a straightforward, effective logit scaling technique applicable during inference or adaptation.
Findings
Miscalibration worsens with distributional drift in CLIP adaptation methods.
Scaling logits to zero-shot prediction logits mitigates miscalibration.
Proposed methods improve calibration across various OOD benchmarks.
Abstract
This paper addresses the critical issue of miscalibration in CLIP-based model adaptation, particularly in the challenging scenario of out-of-distribution (OOD) samples, which has been overlooked in the existing literature on CLIP adaptation. We empirically demonstrate that popular CLIP adaptation approaches, such as Adapters, Prompt Learning, and Test-Time Adaptation, substantially degrade the calibration capabilities of the zero-shot baseline in the presence of distributional drift. We identify the increase in logit ranges as the underlying cause of miscalibration of CLIP adaptation methods, contrasting with previous work on calibrating fully-supervised models. Motivated by these observations, we present a simple and model-agnostic solution to mitigate miscalibration, by scaling the logit range of each sample to its zero-shot prediction logits. We explore three different alternatives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Image Processing Techniques and Applications
MethodsContrastive Language-Image Pre-training
