Countering Multi-modal Representation Collapse through Rank-targeted Fusion
Seulgi Kim, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

TL;DR
This paper introduces a rank-targeted fusion framework that effectively counters feature and modality collapse in multi-modal data, improving performance in human action anticipation tasks.
Contribution
It proposes a theoretically grounded method that enhances the effective rank of fused representations, addressing both feature and modality collapse simultaneously.
Findings
Significantly outperforms prior methods by up to 3.74% on benchmark datasets.
Effectively increases the effective rank of fused features.
Maintains representational balance with depth when fused with RGB.
Abstract
Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Emotion and Mood Recognition
