Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Mushui Liu; Bozheng Li; Yunlong Yu

arXiv:2407.04003·cs.CV·July 8, 2024

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Mushui Liu, Bozheng Li, Yunlong Yu

PDF

Open Access

TL;DR

This paper introduces CLIP-CITE, a framework for fine-tuning large vision-language models efficiently with minimal overfitting, improving few-shot and cross-domain generalization while maintaining versatility.

Contribution

The paper presents a novel fine-tuning method that refines entire VLMs with minimal parameters, addressing overfitting and catastrophic forgetting in limited supervision scenarios.

Findings

01

Enhanced performance in few-shot learning tasks.

02

Improved generalization across domains.

03

Preserved versatility of VLMs on various datasets.

Abstract

Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, they often come at the cost of flexibility and adaptability when the tuned models are applied to different datasets or domains. In this paper, we explore capturing the task-specific information via meticulous refinement of entire VLMs, with minimal parameter adjustments. When fine-tuning the entire VLMs for specific tasks under limited supervision, overfitting and catastrophic forgetting become the defacto factors. To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task, further aligning the visual-text semantics in a supervision manner, and integrating knowledge distillation techniques to preserve the gained knowledge. Extensive experimental results under few-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSparse Evolutionary Training · Knowledge Distillation