Learn to Combine Modalities in Multimodal Deep Learning
Kuan Liu, Yanen Li, Ning Xu, Prem Natarajan

TL;DR
This paper introduces a novel deep learning method that multiplicatively combines multiple modalities to enhance classification accuracy by focusing on more reliable sources and capturing cross-modal correlations.
Contribution
It proposes a multiplicative fusion technique for multimodal data that automatically emphasizes reliable modalities and models cross-modal interactions, improving performance.
Findings
Consistent accuracy improvements across three multimodal classification tasks.
Effective filtering of noise and conflicts between modalities.
Enhanced modeling of cross-modal signal correlations.
Abstract
Combining complementary information from multiple modalities is intuitively appealing for improving the performance of learning-based approaches. However, it is challenging to fully leverage different modalities due to practical challenges such as varying levels of noise and conflicts between modalities. Existing methods do not adopt a joint approach to capturing synergies between the modalities while simultaneously filtering noise and resolving conflicts on a per sample basis. In this work we propose a novel deep neural network based technique that multiplicatively combines information from different source modalities. Thus the model training process automatically focuses on information from more reliable modalities while reducing emphasis on the less reliable modalities. Furthermore, we propose an extension that multiplicatively combines not only the single-source modalities, but a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
