Multimodal Language Analysis with Recurrent Multistage Fusion

Paul Pu Liang; Ziyin Liu; Amir Zadeh; Louis-Philippe Morency

arXiv:1808.03920·cs.LG·August 14, 2018

Multimodal Language Analysis with Recurrent Multistage Fusion

Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Recurrent Multistage Fusion Network (RMFN), a novel approach for modeling complex interactions in multimodal language, achieving state-of-the-art results across multiple datasets by effectively decomposing and integrating intra- and cross-modal signals.

Contribution

The paper proposes a multistage fusion framework that enhances multimodal language modeling by focusing on specialized subsets of signals at each stage, improving interaction modeling.

Findings

01

Achieves state-of-the-art performance on three multimodal datasets.

02

Visualizations show each fusion stage focuses on different signal subsets.

03

Effectively models intra-modal and cross-modal interactions.

Abstract

Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion. Cross-modal interactions are modeled using this multistage fusion approach which builds upon intermediate representations of previous stages. Temporal and intra-modal interactions are modeled by integrating our proposed fusion approach with a system of recurrent neural networks. The RMFN displays…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

righ120/multimodal_nlp
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Music and Audio Processing