ECO: Ensembling Context Optimization for Vision-Language Models
Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico, Becattini, Marco Bertini, Alberto Del Bimbo

TL;DR
This paper introduces ECO, a method that learns an ensemble of diverse textual prompts to enhance vision-language models' few-shot image classification performance without extra inference costs.
Contribution
It proposes a novel prompt ensembling technique that improves zero-shot and few-shot classification by learning multiple diverse contexts.
Findings
Improved few-shot classification accuracy across 11 benchmarks.
Ensembling diverse prompts outperforms single prompt approaches.
No additional inference cost is incurred by the ensemble method.
Abstract
Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsFocus · Contrastive Language-Image Pre-training
