Dense Multimodal Fusion for Hierarchically Joint Representation

Di Hu; Feiping Nie; Xuelong Li

arXiv:1810.03414·cs.CV·October 9, 2018·1 cites

Dense Multimodal Fusion for Hierarchically Joint Representation

Di Hu, Feiping Nie, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces Dense Multimodal Fusion (DMF), a hierarchical feature integration method that stacks shared layers between modality-specific networks, capturing correlations at multiple levels for improved multimodal learning.

Contribution

It proposes a novel dense fusion approach that captures hierarchical correlations across modalities, leading to faster convergence and better performance.

Findings

01

Improved performance on audiovisual speech recognition

02

Enhanced cross-modal retrieval accuracy

03

Lower training loss and faster convergence

Abstract

Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities. However, previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis