The Double-Ellipsoid Geometry of CLIP
Meir Yossef Levi, Guy Gilboa

TL;DR
This paper reveals that CLIP embeddings form ellipsoid shells rather than centered spheres, and introduces a conformity measure to better understand and improve contrastive training by accounting for data uncertainty.
Contribution
It uncovers the geometric structure of CLIP embeddings as ellipsoids and proposes a new conformity measure to enhance contrastive learning based on this geometry.
Findings
Text and image embeddings lie on linearly separable ellipsoid shells.
Conformity can be estimated by cosine similarity to the mean vector.
CLIP's modality gap aligns conformity distributions of images and texts.
Abstract
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCancer, Hypoxia, and Metabolism · Carbohydrate Chemistry and Synthesis · Advanced Topics in Algebra
