Three-Stage Speaker Verification Architecture in Emotional Talking Environments
Ismail Shahin, Ali Bou Nassif

TL;DR
This paper introduces a three-stage speaker verification system that incorporates gender and emotion recognition to improve accuracy in emotional talking environments, addressing the mismatch problem between training and testing conditions.
Contribution
The novel three-stage architecture effectively combines gender and emotion identification to enhance speaker verification in emotional environments, outperforming traditional methods.
Findings
The proposed framework achieves verification performance comparable to human listeners.
Incorporating emotion and gender information improves accuracy over single-factor methods.
Evaluations on two independent datasets validate the robustness of the approach.
Abstract
Speaker verification performance in neutral talking environment is usually high, while it is sharply decreased in emotional talking environments. This performance degradation in emotional environments is due to the problem of mismatch between training in neutral environment while testing in emotional environments. In this work, a three-stage speaker verification architecture has been proposed to enhance speaker verification performance in emotional environments. This architecture is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been evaluated on two distinct and independent emotional speech datasets: in-house dataset and Emotional Prosody Speech and Transcripts dataset. Our results show that speaker verification based on both gender information and emotion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
