Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil; Raiymbek Akshulakov; Yasser Abdelaziz Dahou; Djilali; Sanath Narayan; Ankit Singh; Noel E. O'Connor

arXiv:2409.19425·cs.CV·March 25, 2025

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou, Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a framework that uses frozen unimodal encoders for multimodal alignment, achieving competitive zero-shot performance with significantly less data and compute, thus improving accessibility and flexibility in multimodal model development.

Contribution

The authors propose a novel method to align vision and language using frozen unimodal encoders, reducing data and compute needs compared to traditional multimodal training.

Findings

01

Achieves 76% accuracy on ImageNet with less data and compute.

02

Reduces data requirements by 20-fold and compute by 65-fold.

03

Enables flexible multimodal alignment without training from scratch.

Abstract

Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mayug/freeze-align
pytorch

Datasets

mayug/concept_coverage_laion_6m
dataset· 34 dl
34 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · linguistics and terminology studies

MethodsContrastive Language-Image Pre-training