Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
Jakub Grzywaczewski, Dawid P{\l}udowski, Przemys{\l}aw Biecek

TL;DR
This paper analyzes the latent space of contrastively pretrained vision-language models, revealing shared noise dimensions that can be pruned without harming performance, thus providing insights into their representational structure.
Contribution
It introduces spectral decomposition of covariance matrices to distinguish semantic signals from shared noise in VLMs, offering a new mechanistic understanding.
Findings
Shared noise subspace exhibits subgroup invariance.
Pruning shared noise dimensions preserves or improves performance.
A significant part of the latent geometry is governed by architecture-level noise.
Abstract
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
