IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Simone Magistri, Dipam Goswami, Marco Mistretta, Bart{\l}omiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

TL;DR
This paper analyzes intra-modal misalignment in CLIP, identifying the role of projectors, and proposes a training-free method to improve intra-modal alignment by removing anisotropic directions, leading to better intra-modal retrieval and classification.
Contribution
It introduces a spectral analysis of CLIP's projectors to identify and remove anisotropic directions, enhancing intra-modal alignment without additional training.
Findings
Removing anisotropic directions improves intra-modal retrieval.
The method reduces intra-modal misalignment across multiple CLIP models.
The approach is training-free and lowers latency.
Abstract
Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
