Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

Seong-Gyun Leem; Daniel Fulford; Jukka-Pekka Onnela; David Gard; and Carlos Busso

arXiv:2407.17716·cs.SD·November 11, 2025·1 cites

Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, and Carlos Busso

PDF

Open Access

TL;DR

This paper introduces a novel environment-aware training method for speech emotion recognition that leverages text descriptions of ambient noise to enhance noise robustness, especially under very noisy conditions.

Contribution

It proposes a text-guided, environment-aware training framework that uses a large language model to incorporate environment descriptions, improving SER performance in noisy environments.

Findings

01

Significant performance improvements at -5dB SNR, with up to 100% accuracy gain in dominance detection.

02

Effective use of text-based environment embeddings to enhance noise robustness.

03

Joint fine-tuning of text encoder and emotion recognition model further boosts results.

Abstract

Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis