Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou, Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor

TL;DR
This paper introduces a framework that uses frozen unimodal encoders for multimodal alignment, achieving competitive zero-shot performance with significantly less data and compute, thus improving accessibility and flexibility in multimodal model development.
Contribution
The authors propose a novel method to align vision and language using frozen unimodal encoders, reducing data and compute needs compared to traditional multimodal training.
Findings
Achieves 76% accuracy on ImageNet with less data and compute.
Reduces data requirements by 20-fold and compute by 65-fold.
Enables flexible multimodal alignment without training from scratch.
Abstract
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · linguistics and terminology studies
MethodsContrastive Language-Image Pre-training
