Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

Dipam Goswami; Simone Magistri; Gido M. van de Ven; Bart{\l}omiej Twardowski; Andrew D. Bagdanov; Tinne Tuytelaars; Joost van de Weijer

arXiv:2603.24528·cs.CV·March 26, 2026

Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer

PDF

Open Access

TL;DR

This paper introduces a novel approach for few-shot classification using cross-modal prototype mixing and alignment, improving performance by projecting image prototypes onto a semantic text space and modeling class covariances.

Contribution

It proposes a method to enhance CLIP-based few-shot classification by mixing prototypes and aligning image features with text semantics, incorporating bias-variance analysis and covariance modeling.

Findings

01

Mixed prototypes act as shrinkage estimators.

02

Text-aligned image prototypes improve classification.

03

Combining text-aligned and covariance-based classifiers outperforms existing methods.

Abstract

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling