Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling
David M\'endez, Roberto Confalonieri, Natalia D\'iaz Rodr\'iguez

TL;DR
This paper proposes repurposing pretrained vision model classification heads as semantic prototypes to improve vision-language alignment and cross-modal retrieval without extensive additional training.
Contribution
It introduces a novel method of weight recycling from pretrained models to enhance zero-shot and few-shot vision-language tasks.
Findings
Boosts accuracy in cross-modal retrieval tasks.
Enhances zero- and few-shot classification performance.
Provides a robust data augmentation strategy.
Abstract
Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
