Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners
Mushui Liu, Bozheng Li, Yunlong Yu

TL;DR
This paper introduces CLIP-CITE, a framework for fine-tuning large vision-language models efficiently with minimal overfitting, improving few-shot and cross-domain generalization while maintaining versatility.
Contribution
The paper presents a novel fine-tuning method that refines entire VLMs with minimal parameters, addressing overfitting and catastrophic forgetting in limited supervision scenarios.
Findings
Enhanced performance in few-shot learning tasks.
Improved generalization across domains.
Preserved versatility of VLMs on various datasets.
Abstract
Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, they often come at the cost of flexibility and adaptability when the tuned models are applied to different datasets or domains. In this paper, we explore capturing the task-specific information via meticulous refinement of entire VLMs, with minimal parameter adjustments. When fine-tuning the entire VLMs for specific tasks under limited supervision, overfitting and catastrophic forgetting become the defacto factors. To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task, further aligning the visual-text semantics in a supervision manner, and integrating knowledge distillation techniques to preserve the gained knowledge. Extensive experimental results under few-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSparse Evolutionary Training · Knowledge Distillation
