Towards Calibrating Prompt Tuning of Vision-Language Models

Ashshak Sharifdeen; Fahad Shamshad; Muhammad Akhtar Munir; Abhishek Basu; Mohamed Insaf Ismithdeen; Jeyapriyan Jeyamohan; Chathurika Sewwandi Silva; Karthik Nandakumar; Muhammad Haris Khan

arXiv:2602.19024·cs.CV·February 24, 2026

Towards Calibrating Prompt Tuning of Vision-Language Models

Ashshak Sharifdeen, Fahad Shamshad, Muhammad Akhtar Munir, Abhishek Basu, Mohamed Insaf Ismithdeen, Jeyapriyan Jeyamohan, Chathurika Sewwandi Silva, Karthik Nandakumar, Muhammad Haris Khan

PDF

Open Access

TL;DR

This paper introduces a calibration framework for prompt tuning of vision-language models like CLIP, improving confidence reliability and uncertainty estimation without compromising the embedding space's geometry.

Contribution

It proposes a novel calibration method with regularizers that stabilize logit margins and align text embeddings, enhancing calibration while maintaining model generalization.

Findings

01

Significantly reduces Expected Calibration Error (ECE) across multiple datasets.

02

Effective across 7 prompt-tuning methods and 11 diverse datasets.

03

Preserves the semantic structure of the embedding space for robust generalization.

Abstract

Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling