Factorized Multimodal Transformer for Multimodal Sequential Learning

Amir Zadeh; Chengfeng Mao; Kelly Shi; Yiwei Zhang; Paul Pu Liang,; Soujanya Poria; Louis-Philippe Morency

arXiv:1911.09826·cs.LG·November 25, 2019·37 cites

Factorized Multimodal Transformer for Multimodal Sequential Learning

Amir Zadeh, Chengfeng Mao, Kelly Shi, Yiwei Zhang, Paul Pu Liang,, Soujanya Poria, Louis-Philippe Morency

PDF

Open Access

TL;DR

The paper introduces the Factorized Multimodal Transformer (FMT), a novel model designed to effectively learn complex multimodal sequential data by modeling intra- and intermodal dynamics with enhanced attention mechanisms, achieving state-of-the-art results.

Contribution

The paper proposes a new transformer architecture, FMT, that factorizes multimodal dynamics, enabling better modeling of complex data without overfitting, even with limited resources.

Findings

01

FMT outperforms previous models on multiple datasets

02

Achieves new state-of-the-art performance in multimodal tasks

03

Effectively models long-range multimodal dependencies

Abstract

The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax