Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph   Pooling Fusion

Sijie Mai; Songlong Xing; Jiaxuan He; Ying Zeng; Haifeng Hu

arXiv:2011.13572·cs.AI·April 26, 2021·26 cites

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

Sijie Mai, Songlong Xing, Jiaxuan He, Ying Zeng, Haifeng Hu

PDF

Open Access

TL;DR

This paper introduces a novel graph neural network approach for analyzing unaligned multimodal sequences, effectively capturing long-term dependencies and outperforming existing methods on benchmark datasets.

Contribution

The paper proposes a hierarchical multimodal graph model with graph convolution and pooling for unaligned sequence analysis, addressing limitations of RNNs and word-level fusion.

Findings

01

Achieves state-of-the-art results on benchmark datasets.

02

Effectively models intra- and inter-modal dynamics.

03

Handles unaligned multimodal sequences efficiently.

Abstract

In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization