Xi-Vector Embedding for Speaker Recognition

Kong Aik Lee; Qiongqiong Wang; Takafumi Koshinaka

arXiv:2108.05679·eess.AS·August 13, 2021

Xi-Vector Embedding for Speaker Recognition

Kong Aik Lee, Qiongqiong Wang, Takafumi Koshinaka

PDF

Open Access

TL;DR

This paper introduces the xi-vector, a Bayesian deep speaker embedding that incorporates uncertainty estimation, leading to significant improvements in speaker recognition accuracy over traditional x-vectors.

Contribution

It proposes a novel Bayesian extension to x-vectors with an auxiliary neural network for uncertainty prediction, enhancing speaker recognition performance.

Findings

01

Over 17.5% reduction in equal-error-rate

02

Over 10.9% reduction in detection cost

03

Consistent improvement across all operating points

Abstract

We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate. On the technology front, we offer a simple and straightforward extension to the now widely used x-vector. It consists of an auxiliary neural net predicting the frame-wise uncertainty of the input sequence. We show that the proposed extension leads to substantial improvement across all operating points, with a significant reduction in error rates and detection cost. On the theoretical front, our proposal integrates the Bayesian formulation of linear Gaussian model to speaker-embedding neural networks via the pooling layer. In one sense, our proposal integrates the Bayesian formulation of the i-vector to that of the x-vector. Hence, we refer to the embedding as the xi-vector, which is pronounced as /zai/ vector.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing