Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures
Jia Qi Yip, Dianwen Ng, Bin Ma, Chng Eng Siong

TL;DR
This paper investigates how emotional content in speech mixtures degrades the performance of speech separation models, highlighting the need to consider emotions for real-world applications.
Contribution
It introduces a balanced emotional speech dataset and analyzes the impact of emotions on separation performance, revealing significant degradation even in strong models.
Findings
Emotional speech causes up to 5.1 dB SI-SDRi performance loss.
Models trained on neutral data still degrade with emotional speech.
Emotions significantly affect speech separation effectiveness.
Abstract
Despite recent strides made in Speech Separation, most models are trained on datasets with neutral emotions. Emotional speech has been known to degrade performance of models in a variety of speech tasks, which reduces the effectiveness of these models when deployed in real-world scenarios. In this paper we perform analysis to differentiate the performance degradation arising from the emotions in speech from the impact of out-of-domain inference. This is measured using a carefully designed test dataset, Emo2Mix, consisting of balanced data across all emotional combinations. We show that even models with strong out-of-domain performance such as Sepformer can still suffer significant degradation of up to 5.1 dB SI-SDRi on mixtures with strong emotions. This demonstrates the importance of accounting for emotions in real-world speech separation applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
