TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition
Chengxin Chen, Pengyuan Zhang

TL;DR
TRNet is a novel two-level refinement network that uses speech enhancement techniques to improve noise robustness in speech emotion recognition, maintaining performance across noisy and noise-free conditions.
Contribution
Introduces TRNet, a two-level refinement approach combining speech enhancement and deep representation refinement for robust SER in noisy environments.
Findings
Significantly improves SER accuracy in noisy conditions
Maintains performance in noise-free environments
Effective in both matched and unmatched noise scenarios
Abstract
One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
