Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer

TL;DR
This paper introduces a novel approach for few-shot classification using cross-modal prototype mixing and alignment, improving performance by projecting image prototypes onto a semantic text space and modeling class covariances.
Contribution
It proposes a method to enhance CLIP-based few-shot classification by mixing prototypes and aligning image features with text semantics, incorporating bias-variance analysis and covariance modeling.
Findings
Mixed prototypes act as shrinkage estimators.
Text-aligned image prototypes improve classification.
Combining text-aligned and covariance-based classifiers outperforms existing methods.
Abstract
Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
