Understanding the Emergence of Multimodal Representation Alignment

Megan Tjandrasuwita; Chanakya Ekbote; Liu Ziyin; Paul Pu Liang

arXiv:2502.16282·cs.LG·June 16, 2025

Understanding the Emergence of Multimodal Representation Alignment

Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how and when implicit alignment of multimodal representations emerges in models trained independently, revealing that alignment depends on data characteristics and may not always correlate with improved task performance.

Contribution

It provides a comprehensive empirical analysis of the conditions under which multimodal representation alignment emerges and its relationship with task performance.

Findings

01

Alignment emergence depends on modality similarity and data redundancy.

02

Alignment's impact on performance varies across datasets and tasks.

03

Implicit alignment may not always indicate better task outcomes.

Abstract

Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research primarily focused on explicitly aligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can become implicitly aligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

megantj/multimodal_alignment
pytorchOfficial

Videos

Understanding the Emergence of Multimodal Representation Alignment· slideslive

Taxonomy

TopicsLanguage, Metaphor, and Cognition