MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion   Recognition

Peihao Xiang; Chaohao Lin; Kaida Wu; Ou Bai

arXiv:2404.18327·cs.CV·October 17, 2024

MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai

PDF

Open Access 1 Repo

TL;DR

This paper introduces MultiMAE-DER, a multimodal masked autoencoder that improves dynamic emotion recognition by leveraging cross-modal correlations and optimizing fusion strategies, achieving state-of-the-art results on multiple datasets.

Contribution

The paper proposes a novel multimodal masked autoencoder approach with optimized fusion strategies for dynamic emotion recognition, outperforming existing models in supervised and self-supervised settings.

Findings

01

WAR improved by 4.41% on RAVDESS

02

WAR improved by 2.06% on CREMAD

03

WAR improved by 1.86% on IEMOCAP

Abstract

This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Peihao-Xiang/MultiMAE-DFER
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsSoftmax · Linear Layer · Self-Learning · Denoising Autoencoder · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer