CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
Chunlei Meng, Guanhong Huang, Rong Fu, Runmin Jian, Zhongxue Gan, Chun Ouyang

TL;DR
This paper introduces CLCR, a novel multimodal learning framework that organizes features into a semantic hierarchy and employs cross-level interactions to improve shared and private information extraction, enhancing performance across diverse tasks.
Contribution
The paper proposes a hierarchical semantic organization and cross-level interaction mechanism for multimodal data, addressing semantic misalignment and error propagation in existing fusion methods.
Findings
Achieves state-of-the-art results on six multimodal benchmarks.
Effectively separates shared and private features across modalities.
Demonstrates strong generalization across various tasks.
Abstract
Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Multimodal Machine Learning Applications
