MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition
Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai

TL;DR
This paper introduces MultiMAE-DER, a multimodal masked autoencoder that improves dynamic emotion recognition by leveraging cross-modal correlations and optimizing fusion strategies, achieving state-of-the-art results on multiple datasets.
Contribution
The paper proposes a novel multimodal masked autoencoder approach with optimized fusion strategies for dynamic emotion recognition, outperforming existing models in supervised and self-supervised settings.
Findings
WAR improved by 4.41% on RAVDESS
WAR improved by 2.06% on CREMAD
WAR improved by 1.86% on IEMOCAP
Abstract
This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
MethodsSoftmax · Linear Layer · Self-Learning · Denoising Autoencoder · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer
