MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge   in Speech Emotion Recognition

Haiyang Sun; Fulin Zhang; Yingying Gao; Zheng Lian; Shilei Zhang,; Junlan Feng

arXiv:2306.09361·eess.AS·June 27, 2024·2 cites

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

Haiyang Sun, Fulin Zhang, Yingying Gao, Zheng Lian, Shilei Zhang,, Junlan Feng

PDF

Open Access

TL;DR

This paper introduces MFSN, a novel pre-training framework for Speech Emotion Recognition that leverages multi-perspective fusion of semantic and acoustic cues, improving emotional understanding accuracy.

Contribution

The paper proposes a new architecture search space and a dual-perspective approach to comprehensively and appropriately capture emotional cues in speech.

Findings

01

MFSN outperforms existing methods on multiple datasets.

02

The dual-perspective approach improves emotion recognition accuracy.

03

The architecture search space effectively leverages semantic and acoustic cues.

Abstract

Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and dialogue systems · Speech Recognition and Synthesis