What Makes Multi-modal Learning Better than Single (Provably)
Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, Longbo, Huang

TL;DR
This paper provides the first theoretical proof that multi-modal learning outperforms single-modal learning in terms of population risk, supported by experiments, under a common fusion framework.
Contribution
It offers a novel theoretical analysis demonstrating that multi-modal learning has a smaller population risk than uni-modal learning, explaining observed empirical advantages.
Findings
Multi-modal learning achieves lower population risk than uni-modal.
Theoretical justification for multi-modal superiority is established.
Experimental results support the theoretical claims.
Abstract
The world provides us with data of multiple modalities. Intuitively, models fusing data from different modalities outperform their uni-modal counterparts, since more information is aggregated. Recently, joining the success of deep learning, there is an influential line of work on deep multi-modal learning, which has remarkable empirical results on various applications. However, theoretical justifications in this field are notably lacking. Can multi-modal learning provably perform better than uni-modal? In this paper, we answer this question under a most popular multi-modal fusion framework, which firstly encodes features from different modalities into a common latent space and seamlessly maps the latent representations into the task space. We prove that learning with multiple modalities achieves a smaller population risk than only using its subset of modalities. The main intuition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
