A cross-modal fusion network based on self-attention and residual   structure for multimodal emotion recognition

Ziwang Fu; Feng Liu; Hanyang Wang; Jiayin Qi; Xiangling Fu; Aimin; Zhou; Zhibin Li

arXiv:2111.02172·cs.CV·November 4, 2021·28 cites

A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition

Ziwang Fu, Feng Liu, Hanyang Wang, Jiayin Qi, Xiangling Fu, Aimin, Zhou, Zhibin Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal fusion network using self-attention and residual structures for multimodal emotion recognition, effectively preserving semantic information and enhancing performance on the RAVDESS dataset.

Contribution

The paper proposes a new fusion network that combines self-attention and residual structures to improve multimodal emotion recognition by maintaining semantic integrity.

Findings

01

Achieves 75.76% accuracy on RAVDESS dataset

02

Outperforms existing methods in multimodal emotion recognition

03

Uses 26.30 million parameters for the model

Abstract

The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

skeletonnn/cfn-sr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Music and Audio Processing

Methods1x1 Convolution · ResNeXt Block · Convolution · 1-Dimensional Convolutional Neural Networks · Grouped Convolution · Kaiming Initialization · *Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Global Average Pooling · Batch Normalization