Canonicalizing Multimodal Contrastive Representation Learning
Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas Garg

TL;DR
This paper demonstrates that independently trained multimodal contrastive models like CLIP and FLAVA are related by an orthogonal transformation, enabling alignment of their embedding spaces and facilitating model compatibility and privacy considerations.
Contribution
The work reveals a universal orthogonal relationship between different multimodal contrastive models' embedding spaces, supported by theoretical proof and empirical validation.
Findings
Embedding spaces are related by an orthogonal map Q.
The same Q aligns both image and text encoders.
The relationship holds across different model architectures and training distributions.
Abstract
As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders and ) -- trained on different distributions and with different architectures -- does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
