Two-stage Framework for Robust Speech Emotion Recognition Using Target   Speaker Extraction in Human Speech Noise Conditions

Jinyi Mi; Xiaohan Shi; Ding Ma; Jiajun He; Takuya Fujimura; Tomoki; Toda

arXiv:2409.19585·cs.SD·December 18, 2024

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Jinyi Mi, Xiaohan Shi, Ding Ma, Jiajun He, Takuya Fujimura, Tomoki, Toda

PDF

Open Access

TL;DR

This paper introduces a two-stage framework combining target speaker extraction and speech emotion recognition to improve robustness in noisy human speech environments, especially in gender-diverse mixtures.

Contribution

The paper proposes a novel two-stage approach with joint training for robust SER in human speech noise conditions, addressing a gap in prior research.

Findings

01

Achieved 14.33% improvement in unweighted accuracy over baseline

02

Effective in different-gender speech mixtures

03

Joint training enhances system performance

Abstract

Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis