ECO: Ensembling Context Optimization for Vision-Language Models

Lorenzo Agnolucci; Alberto Baldrati; Francesco Todino; Federico; Becattini; Marco Bertini; Alberto Del Bimbo

arXiv:2307.14063·cs.CV·July 27, 2023

ECO: Ensembling Context Optimization for Vision-Language Models

Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico, Becattini, Marco Bertini, Alberto Del Bimbo

PDF

Open Access

TL;DR

This paper introduces ECO, a method that learns an ensemble of diverse textual prompts to enhance vision-language models' few-shot image classification performance without extra inference costs.

Contribution

It proposes a novel prompt ensembling technique that improves zero-shot and few-shot classification by learning multiple diverse contexts.

Findings

01

Improved few-shot classification accuracy across 11 benchmarks.

02

Ensembling diverse prompts outperforms single prompt approaches.

03

No additional inference cost is incurred by the ensemble method.

Abstract

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsFocus · Contrastive Language-Image Pre-training