A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition
Ziwang Fu, Feng Liu, Hanyang Wang, Jiayin Qi, Xiangling Fu, Aimin, Zhou, Zhibin Li

TL;DR
This paper introduces a novel cross-modal fusion network using self-attention and residual structures for multimodal emotion recognition, effectively preserving semantic information and enhancing performance on the RAVDESS dataset.
Contribution
The paper proposes a new fusion network that combines self-attention and residual structures to improve multimodal emotion recognition by maintaining semantic integrity.
Findings
Achieves 75.76% accuracy on RAVDESS dataset
Outperforms existing methods in multimodal emotion recognition
Uses 26.30 million parameters for the model
Abstract
The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Music and Audio Processing
Methods1x1 Convolution · ResNeXt Block · Convolution · 1-Dimensional Convolutional Neural Networks · Grouped Convolution · Kaiming Initialization · *Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Global Average Pooling · Batch Normalization
