TL;DR
This paper introduces SSP, a training-free method that improves image-text alignment in CLIP by using local features to project data into subspaces, significantly enhancing few-shot classification performance.
Contribution
The paper proposes a novel, training-free subspace projection method that better aligns image and text features in CLIP for few-shot learning.
Findings
Outperforms state-of-the-art alignment methods on 11 datasets
Enhances the quality of image-text feature alignment
Seamlessly integrates into existing CLIP-based frameworks
Abstract
Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
