Xi-Vector Embedding for Speaker Recognition
Kong Aik Lee, Qiongqiong Wang, Takafumi Koshinaka

TL;DR
This paper introduces the xi-vector, a Bayesian deep speaker embedding that incorporates uncertainty estimation, leading to significant improvements in speaker recognition accuracy over traditional x-vectors.
Contribution
It proposes a novel Bayesian extension to x-vectors with an auxiliary neural network for uncertainty prediction, enhancing speaker recognition performance.
Findings
Over 17.5% reduction in equal-error-rate
Over 10.9% reduction in detection cost
Consistent improvement across all operating points
Abstract
We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate. On the technology front, we offer a simple and straightforward extension to the now widely used x-vector. It consists of an auxiliary neural net predicting the frame-wise uncertainty of the input sequence. We show that the proposed extension leads to substantial improvement across all operating points, with a significant reduction in error rates and detection cost. On the theoretical front, our proposal integrates the Bayesian formulation of linear Gaussian model to speaker-embedding neural networks via the pooling layer. In one sense, our proposal integrates the Bayesian formulation of the i-vector to that of the x-vector. Hence, we refer to the embedding as the xi-vector, which is pronounced as /zai/ vector.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
