Whitened CLIP as a Likelihood Surrogate of Images and Captions
Roy Betser, Meir Yossef Levi, Guy Gilboa

TL;DR
This paper introduces Whitened CLIP, a training-free transformation of CLIP embeddings that enables efficient likelihood estimation for images and captions by approximating their distribution as standard normal.
Contribution
The paper proposes a novel whitening transformation of CLIP embeddings that allows likelihood estimation without additional training, simplifying analysis of images and captions.
Findings
Whitened CLIP embeddings approximate a standard normal distribution.
Likelihood scores derived from whitened embeddings are effective for assessing images and captions.
The whitening process is fast and does not require training.
Abstract
Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
