Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners
N.K.B.M.P.K.B. Narasinghe, Uthayasanker Thayasivam

TL;DR
This paper systematically studies how to adapt Contrastive Captioners (CoCa) models for few-shot image classification, revealing key insights on data augmentation, hybrid objectives, and training strategies to improve performance with limited data.
Contribution
It provides a comprehensive empirical analysis of fine-tuning CoCa models for few-shot learning, exploring strategies like hybrid objectives and LoRA, and offers practical guidelines for adaptation.
Findings
Strong data augmentation degrades linear probe performance in low-shot settings.
Hybrid objectives with Supervised Contrastive loss improve accuracy over standard Cross-Entropy.
Empirical guidelines for regularization, rank, and sampling strategies in low-data regimes.
Abstract
Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
