TL;DR
This paper introduces an end-to-end neural network model that jointly optimizes speaker embedding extraction and backend scoring, significantly improving speaker verification accuracy over traditional x-vector PLDA systems.
Contribution
It presents a novel end-to-end training approach that combines neural embedding and scoring models for speaker verification, enhancing performance.
Findings
Significant accuracy improvements on NIST SRE 2018 and 2019 datasets.
End-to-end model outperforms traditional x-vector PLDA baseline.
Joint optimization leads to better verification scores.
Abstract
While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
