An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition
Yang Wu, Pai Peng, Zhenyu Zhang, Yanyan Zhao, Bing Qin

TL;DR
This paper introduces ME2ET, an end-to-end transformer model that effectively captures tri-modal feature interactions for emotion recognition, improving performance and efficiency on standard datasets.
Contribution
The paper proposes a novel progressive tri-modal attention mechanism and a tri-modal feature fusion layer for enhanced multi-modal emotion recognition.
Findings
Achieves state-of-the-art results on CMU-MOSEI and IEMOCAP datasets.
Reduces computational and memory costs through the two-pass attention strategy.
Demonstrates improved interpretability of multi-modal interactions.
Abstract
Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. However, previous methods only model the feature interactions between the textual and either acoustic and visual modalities, ignoring capturing the feature interactions between the acoustic and visual modalities. In this paper, we propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Advanced Computing and Algorithms · Sentiment Analysis and Opinion Mining
