A Closer Look at Multimodal Representation Collapse

Abhra Chaudhuri; Anjan Dutta; Tu Bui; Serban Georgescu

arXiv:2505.22483·cs.LG·August 18, 2025

A Closer Look at Multimodal Representation Collapse

Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu

PDF

Open Access 1 Video

TL;DR

This paper investigates the phenomenon of modality collapse in multimodal models, revealing its causes, proposing a disentanglement solution via cross-modal distillation, and demonstrating its effectiveness through extensive experiments.

Contribution

It provides a theoretical understanding of modality collapse, introduces an algorithm to prevent it, and validates the approach on multiple benchmarks.

Findings

01

Modality collapse occurs due to entangled noisy features in the fusion head.

02

Cross-modal knowledge distillation helps disentangle representations and mitigate collapse.

03

The proposed algorithm effectively prevents modality collapse in various benchmarks.

Abstract

We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Closer Look at Multimodal Representation Collapse· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Emotion and Mood Recognition