Selective Vision-Language Subspace Projection for Few-shot CLIP

Xingyu Zhu; Beier Zhu; Yi Tan; Shuo Wang; Yanbin Hao; Hanwang Zhang

arXiv:2407.16977·cs.CV·July 29, 2024

Selective Vision-Language Subspace Projection for Few-shot CLIP

Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang

PDF

1 Repo

TL;DR

This paper introduces SSP, a training-free method that improves image-text alignment in CLIP by using local features to project data into subspaces, significantly enhancing few-shot classification performance.

Contribution

The paper proposes a novel, training-free subspace projection method that better aligns image and text features in CLIP for few-shot learning.

Findings

01

Outperforms state-of-the-art alignment methods on 11 datasets

02

Enhances the quality of image-text feature alignment

03

Seamlessly integrates into existing CLIP-based frameworks

Abstract

Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhuhsingyuu/ssp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training