ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models
Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, Zeynep, Akata

TL;DR
ProbVLM introduces a probabilistic approach to estimate uncertainty in embeddings of pre-trained vision-language models, improving retrieval, active learning, and model selection without large datasets or additional training.
Contribution
It presents ProbVLM, a novel post-hoc probabilistic adapter that captures embedding uncertainties in VLMs, enhancing their interpretability and downstream task performance.
Findings
ProbVLM outperforms existing methods in uncertainty estimation across four datasets.
Uncertainty estimates improve retrieval accuracy and model selection.
Visualization of embedding distributions is enabled using a latent diffusion model.
Abstract
Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsBLIP: Bootstrapping Language-Image Pre-training · Diffusion · Adapter · Contrastive Language-Image Pre-training
