Factorized Multimodal Transformer for Multimodal Sequential Learning
Amir Zadeh, Chengfeng Mao, Kelly Shi, Yiwei Zhang, Paul Pu Liang,, Soujanya Poria, Louis-Philippe Morency

TL;DR
The paper introduces the Factorized Multimodal Transformer (FMT), a novel model designed to effectively learn complex multimodal sequential data by modeling intra- and intermodal dynamics with enhanced attention mechanisms, achieving state-of-the-art results.
Contribution
The paper proposes a new transformer architecture, FMT, that factorizes multimodal dynamics, enabling better modeling of complex data without overfitting, even with limited resources.
Findings
FMT outperforms previous models on multiple datasets
Achieves new state-of-the-art performance in multimodal tasks
Effectively models long-range multimodal dependencies
Abstract
The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
