Multi-modal Speech Emotion Recognition via Feature Distribution   Adaptation Network

Shaokai Li; Yixuan Ji; Peng Song; Haoqin Sun; Wenming Zheng

arXiv:2410.22023·cs.CV·November 5, 2024

Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation Network

Shaokai Li, Yixuan Ji, Peng Song, Haoqin Sun, Wenming Zheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a deep transfer learning framework that aligns visual and audio features for improved multi-modal speech emotion recognition, utilizing feature distribution adaptation and cross-attention mechanisms.

Contribution

It presents a novel feature distribution adaptation network that effectively combines visual and audio modalities for emotion recognition using deep transfer learning.

Findings

01

Achieves superior performance on benchmark datasets

02

Effectively aligns multi-modal feature distributions

03

Utilizes cross-attention for intrinsic similarity modeling

Abstract

In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shaokai1209/fdan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsALIGN