Enhancing Segment-Based Speech Emotion Recognition by Deep Self-Learning
Shuiyang Mao, P. C. Ching, and Tan Lee

TL;DR
This paper introduces a deep self-learning framework for segment-based speech emotion recognition that iteratively refines noisy segment labels, significantly improving model performance on emotional speech datasets.
Contribution
It proposes a novel deep self-learning approach that dynamically corrects segment labels, addressing label noise issues in segment-based speech emotion recognition.
Findings
Significant performance improvements on three emotional corpora.
Effective label correction reduces noise impact.
Enhanced robustness of emotion recognition models.
Abstract
Despite the widespread utilization of deep neural networks (DNNs) for speech emotion recognition (SER), they are severely restricted due to the paucity of labeled data for training. Recently, segment-based approaches for SER have been evolving, which train backbone networks on shorter segments instead of whole utterances, and thus naturally augments training examples without additional resources. However, one core challenge remains for segment-based approaches: most emotional corpora do not provide ground-truth labels at the segment level. To supervisely train a segment-based emotion model on such datasets, the most common way assigns each segment the corresponding utterance's emotion label. However, this practice typically introduces noisy (incorrect) labels as emotional information is not uniformly distributed across the whole utterance. On the other hand, DNNs have been shown to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Music and Audio Processing
MethodsSelf-Learning
