Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

N.K.B.M.P.K.B. Narasinghe; Uthayasanker Thayasivam

arXiv:2512.12824·cs.CV·December 16, 2025

Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

N.K.B.M.P.K.B. Narasinghe, Uthayasanker Thayasivam

PDF

Open Access

TL;DR

This paper systematically studies how to adapt Contrastive Captioners (CoCa) models for few-shot image classification, revealing key insights on data augmentation, hybrid objectives, and training strategies to improve performance with limited data.

Contribution

It provides a comprehensive empirical analysis of fine-tuning CoCa models for few-shot learning, exploring strategies like hybrid objectives and LoRA, and offers practical guidelines for adaptation.

Findings

01

Strong data augmentation degrades linear probe performance in low-shot settings.

02

Hybrid objectives with Supervised Contrastive loss improve accuracy over standard Cross-Entropy.

03

Empirical guidelines for regularization, rank, and sampling strategies in low-data regimes.

Abstract

Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning