Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot   Classification with CLIP

Yayuan Li; Jintao Guo; Lei Qi; Wenbin Li; Yinghuan Shi

arXiv:2412.11375·cs.CV·December 17, 2024

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Yayuan Li, Jintao Guo, Lei Qi, Wenbin Li, Yinghuan Shi

PDF

Open Access 1 Repo

TL;DR

This paper introduces TIMO, a mutual guidance framework that enhances training-free few-shot classification with CLIP by leveraging mutual image-text guidance, significantly improving performance and surpassing some training-required methods.

Contribution

The paper proposes a novel mutual guidance mechanism, TIMO, that addresses key issues in training-free CLIP-based few-shot learning, leading to state-of-the-art results.

Findings

01

TIMO outperforms existing training-free methods.

02

TIMO-S surpasses training-required methods by 0.33%.

03

The approach reduces time cost by approximately 100 times.

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lyymuwu/timo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis

MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Language-Image Pre-training