Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment
Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, and Carlos Busso

TL;DR
This paper introduces a novel environment-aware training method for speech emotion recognition that leverages text descriptions of ambient noise to enhance noise robustness, especially under very noisy conditions.
Contribution
It proposes a text-guided, environment-aware training framework that uses a large language model to incorporate environment descriptions, improving SER performance in noisy environments.
Findings
Significant performance improvements at -5dB SNR, with up to 100% accuracy gain in dominance detection.
Effective use of text-based environment embeddings to enhance noise robustness.
Joint fine-tuning of text encoder and emotion recognition model further boosts results.
Abstract
Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
