Frame-level emotional state alignment method for speech emotion recognition
Qifei Li, Yingming Gao, Cong Wang, Yayue Deng, Jinlong Xue, Yichen, Han, Ya Li

TL;DR
This paper introduces a novel frame-level emotional state alignment method for speech emotion recognition, improving accuracy by focusing on emotionally consistent frames using a fine-tuned HuBERT model with pseudo-labels.
Contribution
The paper proposes a new frame-level alignment approach with task-adaptive pretraining and clustering, enhancing speech emotion recognition performance over existing methods.
Findings
Outperforms state-of-the-art on IEMOCAP dataset
Effective use of pseudo-labels for frame-level emotion detection
Improved focus on emotionally relevant frames
Abstract
Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address this problem, we propose a frame-level emotional state alignment method for SER. First, we fine-tune HuBERT model to obtain a SER system with task-adaptive pretraining (TAPT) method, and extract embeddings from its transformer layers to form frame-level pseudo-emotion labels with clustering. Then, the pseudo labels are used to pretrain HuBERT. Hence, the each frame output of HuBERT has corresponding emotional information. Finally, we fine-tune the above pretrained HuBERT for SER by adding an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · EEG and Brain-Computer Interfaces
MethodsFocus
