Speaker Verification in Multi-Speaker Environments Using Temporal   Feature Fusion

Ahmad Aloradi; Wolfgang Mack; Mohamed Elminshawi; Emanu\"el A. P.; Habets

arXiv:2206.13808·eess.AS·June 29, 2022

Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

Ahmad Aloradi, Wolfgang Mack, Mohamed Elminshawi, Emanu\"el A. P., Habets

PDF

Open Access

TL;DR

This paper introduces an end-to-end deep learning speaker verification system that detects the presence of a target speaker in multi-speaker environments by fusing reference embeddings with frame-level features, improving accuracy over traditional methods.

Contribution

The paper presents a novel temporal feature fusion approach for speaker verification in multi-speaker settings, addressing limitations of fixed-embedding methods.

Findings

01

Outperforms x-vector in multi-speaker conditions

02

Effective detection of target speakers in overlapping speech

03

Enhances robustness of speaker verification systems

Abstract

Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional embedding from a speech utterance that encodes the speaker's voice characteristics. A speaker is verified if his/her voice embedding is sufficiently similar to the embedding of the claimed speaker. However, such approaches assume that only a single speaker exists in the input. The presence of concurrent speakers is likely to have detrimental effects on the performance. To address SV in a multi-speaker environment, we propose an end-to-end deep learning-based SV system that detects whether the target speaker exists within an input or not. First, an embedding is estimated from a reference utterance to represent the target's characteristics. Second, frame-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing