Dense Multimodal Fusion for Hierarchically Joint Representation
Di Hu, Feiping Nie, Xuelong Li

TL;DR
This paper introduces Dense Multimodal Fusion (DMF), a hierarchical feature integration method that stacks shared layers between modality-specific networks, capturing correlations at multiple levels for improved multimodal learning.
Contribution
It proposes a novel dense fusion approach that captures hierarchical correlations across modalities, leading to faster convergence and better performance.
Findings
Improved performance on audiovisual speech recognition
Enhanced cross-modal retrieval accuracy
Lower training loss and faster convergence
Abstract
Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities. However, previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
