Whitened CLIP as a Likelihood Surrogate of Images and Captions

Roy Betser; Meir Yossef Levi; Guy Gilboa

arXiv:2505.06934·eess.IV·May 13, 2025

Whitened CLIP as a Likelihood Surrogate of Images and Captions

Roy Betser, Meir Yossef Levi, Guy Gilboa

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Whitened CLIP, a training-free transformation of CLIP embeddings that enables efficient likelihood estimation for images and captions by approximating their distribution as standard normal.

Contribution

The paper proposes a novel whitening transformation of CLIP embeddings that allows likelihood estimation without additional training, simplifying analysis of images and captions.

Findings

01

Whitened CLIP embeddings approximate a standard normal distribution.

02

Likelihood scores derived from whitened embeddings are effective for assessing images and captions.

03

The whitening process is fast and does not require training.

Abstract

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rbetser/W_CLIP
pytorchOfficial

Videos

Whitened CLIP as a Likelihood Surrogate of Images and Captions· slideslive

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training