Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP
Yayuan Li, Jintao Guo, Lei Qi, Wenbin Li, Yinghuan Shi

TL;DR
This paper introduces TIMO, a mutual guidance framework that enhances training-free few-shot classification with CLIP by leveraging mutual image-text guidance, significantly improving performance and surpassing some training-required methods.
Contribution
The paper proposes a novel mutual guidance mechanism, TIMO, that addresses key issues in training-free CLIP-based few-shot learning, leading to state-of-the-art results.
Findings
TIMO outperforms existing training-free methods.
TIMO-S surpasses training-required methods by 0.33%.
The approach reduces time cost by approximately 100 times.
Abstract
Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis
MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Language-Image Pre-training
