Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances
Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, and Junichi, Yamagishi

TL;DR
This paper introduces a novel end-to-end speaker verification model that jointly optimizes speaker embedding extraction and similarity scoring, effectively handling multiple enrollment utterances with attention mechanisms and data augmentation.
Contribution
It proposes a new joint end-to-end method specifically designed for multiple enrollment utterances, incorporating attention mechanisms and data augmentation for improved performance.
Findings
Effective handling of multiple enrollment utterances.
Enhanced speaker verification accuracy.
Robustness through data augmentation techniques.
Abstract
Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, all of these methods are designed for use with a single enrollment utterance. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical case of multiple enrollment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMixup
