A Comprehensive Study on the Effectiveness of ASR Representations for Noise-Robust Speech Emotion Recognition

Xiaohan Shi; Jiajun He; Xingfeng Li; Tomoki Toda

arXiv:2311.07093·cs.SD·January 13, 2026·1 cites

A Comprehensive Study on the Effectiveness of ASR Representations for Noise-Robust Speech Emotion Recognition

Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

PDF

Open Access

TL;DR

This study explores the use of ASR model-derived features to improve noise-robust speech emotion recognition, demonstrating superior performance over traditional noise reduction, self-supervised, and text-based methods in noisy environments.

Contribution

Introduces a novel NSER approach using intermediate ASR model features to effectively handle real-world non-stationary noise conditions.

Findings

01

Outperforms conventional noise reduction methods in NSER accuracy.

02

Surpasses self-supervised learning approaches in noisy speech recognition.

03

Even exceeds text-based approaches using ASR or ground truth transcriptions.

Abstract

This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing