A Comprehensive Study on the Effectiveness of ASR Representations for Noise-Robust Speech Emotion Recognition
Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

TL;DR
This study explores the use of ASR model-derived features to improve noise-robust speech emotion recognition, demonstrating superior performance over traditional noise reduction, self-supervised, and text-based methods in noisy environments.
Contribution
Introduces a novel NSER approach using intermediate ASR model features to effectively handle real-world non-stationary noise conditions.
Findings
Outperforms conventional noise reduction methods in NSER accuracy.
Surpasses self-supervised learning approaches in noisy speech recognition.
Even exceeds text-based approaches using ASR or ground truth transcriptions.
Abstract
This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
