Canonicalizing Multimodal Contrastive Representation Learning

Sharut Gupta; Sanyam Kansal; Stefanie Jegelka; Phillip Isola; Vikas Garg

arXiv:2602.17584·cs.LG·February 20, 2026

Canonicalizing Multimodal Contrastive Representation Learning

Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas Garg

PDF

Open Access

TL;DR

This paper demonstrates that independently trained multimodal contrastive models like CLIP and FLAVA are related by an orthogonal transformation, enabling alignment of their embedding spaces and facilitating model compatibility and privacy considerations.

Contribution

The work reveals a universal orthogonal relationship between different multimodal contrastive models' embedding spaces, supported by theoretical proof and empirical validation.

Findings

01

Embedding spaces are related by an orthogonal map Q.

02

The same Q aligns both image and text encoders.

03

The relationship holds across different model architectures and training distributions.

Abstract

As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(f, g)$ ) -- trained on different distributions and with different architectures -- does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis