Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Seulgi Kim; Kiran Kokilepersaud; Mohit Prabhushankar; Ghassan AlRegib

arXiv:2511.06450·cs.CV·February 24, 2026

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Seulgi Kim, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

PDF

Open Access

TL;DR

This paper introduces a rank-targeted fusion framework that effectively counters feature and modality collapse in multi-modal data, improving performance in human action anticipation tasks.

Contribution

It proposes a theoretically grounded method that enhances the effective rank of fused representations, addressing both feature and modality collapse simultaneously.

Findings

01

Significantly outperforms prior methods by up to 3.74% on benchmark datasets.

02

Effectively increases the effective rank of fused features.

03

Maintains representational balance with depth when fused with RGB.

Abstract

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Emotion and Mood Recognition