Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion
Ahmad Aloradi, Wolfgang Mack, Mohamed Elminshawi, Emanu\"el A. P., Habets

TL;DR
This paper introduces an end-to-end deep learning speaker verification system that detects the presence of a target speaker in multi-speaker environments by fusing reference embeddings with frame-level features, improving accuracy over traditional methods.
Contribution
The paper presents a novel temporal feature fusion approach for speaker verification in multi-speaker settings, addressing limitations of fixed-embedding methods.
Findings
Outperforms x-vector in multi-speaker conditions
Effective detection of target speakers in overlapping speech
Enhances robustness of speaker verification systems
Abstract
Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional embedding from a speech utterance that encodes the speaker's voice characteristics. A speaker is verified if his/her voice embedding is sufficiently similar to the embedding of the claimed speaker. However, such approaches assume that only a single speaker exists in the input. The presence of concurrent speakers is likely to have detrimental effects on the performance. To address SV in a multi-speaker environment, we propose an end-to-end deep learning-based SV system that detects whether the target speaker exists within an input or not. First, an embedding is estimated from a reference utterance to represent the target's characteristics. Second, frame-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
