Multimodal Classification via Total Correlation Maximization
Feng Yu, Xiangyu Wu, Yang Yang, Jianfeng Lu

TL;DR
This paper introduces TCMax, a novel method for multimodal classification that maximizes total correlation between features and labels, effectively balancing modalities and capturing inter-modal interactions, leading to improved performance.
Contribution
The paper proposes TCMax, a new hyperparameter-free loss function based on total correlation maximization, with a theoretical foundation and practical effectiveness demonstrated through extensive experiments.
Findings
TCMax outperforms existing methods in multimodal classification.
The approach effectively balances modality contributions and captures inter-modal interactions.
The method is hyperparameter-free and theoretically grounded.
Abstract
Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions…
Peer Reviews
Decision·ICLR 2026 Poster
1. Strong theoretical foundation: The paper provides a clear and rigorous information-theoretic formulation for learning from multimodal inputs. 2. Novel objective function: The introduction of Total Correlation Neural Estimation (TCNE) and the TCMax loss. 3. Conceptual clarity: The theoretical motivation is well-grounded and is easily applied into practice. 4. Empirical validation: The experimental results consistently demonstrate performance improvements over multimodal and unimodal baselin
1. **Vague definition of weak and strong modalities.** Although the paper discusses modality competition, the criteria used to define or quantify “weak” versus “strong” modalities are not clearly specified. Providing a more explicit operational definition or empirical measure would strengthen the theoretical analysis and clarify the interpretation of the results. Additional experiments that explicitly quantify these distinctions would further support the claims. 2. **The claim that the analysis
- The paper is well written and clearly explains the motivation for their method from an information theory perspective. - The mathematical analysis is easy to follow, and the proposed loss is simple, yet effective. - The evaluation of the proposed method against baselines is fair and thorough, spanning across different domains and modalities (images, audios, videos). - The results are strong in most cases and clearly confirms the hypothesis made by the authors regarding the modeling of cross-mo
- While I appreciate the completeness in the evaluation protocol in Table 2, I think it neglects the recent emergence of foundation models in the field of unimodal or multimodal representation learning (such as CLIP for vision and language, DINOv3 for vision, Wav2vec 2.0 for speech). I think it is also important to consider these pre-trained models as feature extractors and to apply your method on top of these, as you did afterwards in Table 4 for another dataset. It would clearly demonstrate th
The paper has a sound proof to derive the lower bound to TC and an interesting idea to compare the expectation of the joint multimodal distribution with the unimodal ones. The problem and the solution as well motivated.
There is one quite major issue. Following the code in the anonymized repo, it seems that the method is using the test set to select the best model during training. This falls under the data leakage between the validation/test set, which are the same in this case. With this major issue, we drive the paper for rejection. Despite that, I will provide some more input on the rest of the paper since the idea is well put. The $I(Z_1;Y) + I(Z_2;Y) + I(Z_1, Z_2 | Y) = TC$ ignores that $Z_1, Z_2$ coul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Face and Expression Recognition
