Multimodal CLIP Inference for Meta-Few-Shot Image Classification
Constance Ferragu, Philomene Chagniot, Vincent Coyette

TL;DR
This paper shows that multimodal foundation models like CLIP can directly excel at meta-few-shot image classification benchmarks without additional training, outperforming existing meta-learning methods.
Contribution
It demonstrates that combining CLIP's text and image modalities enhances few-shot classification performance without extra training, serving as a new baseline.
Findings
CLIP outperforms state-of-the-art meta-few-shot learners on benchmarks.
Multimodal training improves robustness in few-shot learning.
No additional training is needed for CLIP to excel in this setting.
Abstract
In recent literature, few-shot classification has predominantly been defined by the N-way k-shot meta-learning problem. Models designed for this purpose are usually trained to excel on standard benchmarks following a restricted setup, excluding the use of external data. Given the recent advancements in large language and vision models, a question naturally arises: can these models directly perform well on meta-few-shot learning benchmarks? Multimodal foundation models like CLIP, which learn a joint (image, text) embedding, are of particular interest. Indeed, multimodal training has proven to enhance model robustness, especially regarding ambiguities, a limitation frequently observed in the few-shot setup. This study demonstrates that combining modalities from CLIP's text and image encoders outperforms state-of-the-art meta-few-shot learners on widely adopted benchmarks, all without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Medical Imaging Techniques and Applications · COVID-19 diagnosis using AI
