Understanding the Emergence of Multimodal Representation Alignment
Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang

TL;DR
This paper investigates how and when implicit alignment of multimodal representations emerges in models trained independently, revealing that alignment depends on data characteristics and may not always correlate with improved task performance.
Contribution
It provides a comprehensive empirical analysis of the conditions under which multimodal representation alignment emerges and its relationship with task performance.
Findings
Alignment emergence depends on modality similarity and data redundancy.
Alignment's impact on performance varies across datasets and tasks.
Implicit alignment may not always indicate better task outcomes.
Abstract
Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research primarily focused on explicitly aligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can become implicitly aligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition
