Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li, Ruixuan Li

TL;DR
This paper revisits in-domain fine-tuning methods for source-free cross-domain few-shot learning with CLIP, proposing Semantic Probe to improve modality alignment and achieve state-of-the-art results.
Contribution
It analyzes why adapter-based methods outperform prompt-based ones in CDFSL and introduces Semantic Probe to enhance fine-tuning effectiveness.
Findings
LoRA improves modality alignment by rectifying collapsed attention.
Textual EOS token better attends to visual samples.
Semantic Probe enhances performance on four CDFSL benchmarks.
Abstract
Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
