CLIP Adaptation by Intra-modal Overlap Reduction
Alexey Kravets, Vinay Namboodiri

TL;DR
This paper analyzes the intra-modal overlap in CLIP's image embeddings and proposes a lightweight adapter to reduce this overlap, leading to improved few-shot classification accuracy, robustness, and feature discriminability.
Contribution
It introduces a novel intra-modal overlap reduction method via a lightweight adapter, enhancing CLIP's few-shot classification performance and robustness.
Findings
Reduced intra-modal overlap improves classification accuracy
Enhanced robustness to distribution shifts
Features become more discriminative for downstream tasks
Abstract
Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media
MethodsSparse Evolutionary Training · Adapter · Contrastive Language-Image Pre-training
