Low Rank Fusion based Transformers for Multimodal Sequences
Saurav Sahay, Eda Okur, Shachi H Kumar, Lama Nachman

TL;DR
This paper introduces a low-rank multimodal fusion transformer architecture that efficiently models interactions between sensory signals for emotion recognition, achieving comparable performance with fewer parameters and faster training.
Contribution
It proposes a novel low-rank fusion approach within transformer models for multimodal emotion recognition, reducing model complexity and training time.
Findings
Fewer parameters than existing models
Faster training times
Comparable accuracy on emotion recognition datasets
Abstract
Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~\cite{tsai2019MULT} and~\cite{Liu_2018}, we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Multimodal Machine Learning Applications
