Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification
Qiongqiong Wang, Kong Aik Lee, Tianchi Liu

TL;DR
This paper enhances speaker verification by integrating uncertainty estimates from speaker embeddings into the scoring process, leading to significant improvements in verification accuracy across multiple datasets.
Contribution
It introduces a method to incorporate embedding uncertainty into PLDA scoring, including a new posterior covariance derivation and a length scaling technique, improving verification performance.
Findings
14.5%-41.3% EER reduction on VoxCeleb-1 and SITW datasets
Effective uncertainty propagation improves speaker verification accuracy
Significant reductions in minDCF across tested datasets
Abstract
Speech utterances recorded under differing conditions exhibit varying degrees of confidence in their embedding estimates, i.e., uncertainty, even if they are extracted using the same neural network. This paper aims to incorporate the uncertainty estimate produced in the xi-vector network front-end with a probabilistic linear discriminant analysis (PLDA) back-end scoring for speaker verification. To achieve this we derive a posterior covariance matrix, which measures the uncertainty, from the frame-wise precisions to the embedding space. We propose a log-likelihood ratio function for the PLDA scoring with the uncertainty propagation. We also propose to replace the length normalization pre-processing technique with a length scaling technique for the application of uncertainty propagation in the back-end. Experimental results on the VoxCeleb-1, SITW test sets as well as a domain-mismatched…
| Vox1-O | Vox1-H | SITW | CNCeleb | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| - | LN | LS | - | LN | LS | - | LN | LS | - | LN | LS | |
| x | 1.44/0.184 | 1.23/0.173 | 1.23/0.173 | 2.68/0.248 | 2.46/0.235 | 2.46/0.236 | 1.80/0.180 | 1.67/0.172 | 1.69/0.172 | 17.38/0.675 | 13.04/0.625 | 13.00/0.626 |
| xi | 1.49/0.166 | 1.23/0.143 | 1.23/0.141 | 2.76/0.247 | 2.44/0.236 | 2.43/0.235 | 1.86/0.175 | 1.65/0.170 | 1.65/0.170 | 17.32/0.675 | 12.64/0.629 | 12.63/0.630 |
| Back-end | Vox1-O | Vox1-H | SITW | CNCeleb | SITW* | CNCeleb* |
|---|---|---|---|---|---|---|
| PLDA | 1.49/0.166 | 2.76/0.247 | 1.86/0.175 | 17.32/0.675 | 1.91/0.178 | 17.99/0.689 |
| LS PLDA | 1.23/0.141 | 2.43/0.235 | 1.65/0.170 | 12.63/0.630 | 1.77/0.168 | 15.04/0.676 |
| UP-PLDA | 0.99/0.129 | 2.13/0.231 | 1.77/0.167 | 10.81/0.652 | 1.80/0.170 | 13.18/0.714 |
| UP-LS UP-PLDA | 1.01/0.124 | 2.18/0.224 | 1.59/0.167 | 10.16/0.608 | 1.59/0.166 | 11.57/0.686 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTest
Incorporating uncertainty from speaker embedding estimation to speaker verification
Abstract
Speech utterances recorded under differing conditions exhibit varying degrees of confidence in their embedding estimates, i.e., uncertainty, even if they are extracted using the same neural network. This paper aims to incorporate the uncertainty estimate produced in the xi-vector network front-end with a probabilistic linear discriminant analysis (PLDA) back-end scoring for speaker verification. To achieve this we derive a posterior covariance matrix, which measures the uncertainty, from the frame-wise precisions to the embedding space. We propose a log-likelihood ratio function for the PLDA scoring with the uncertainty propagation. We also propose to replace the length normalization pre-processing technique with a length scaling technique for the application of uncertainty propagation in the back-end. Experimental results on the VoxCeleb-1, SITW test sets as well as a domain-mismatched CNCeleb1-E set show the effectiveness of the proposed techniques with –41.3 EER reductions and – minDCF reductions.
Index Terms— speaker embeddings, PLDA, speaker verification, uncertainty, xi-vector
1 Introduction
Automatic speaker verification (ASV) is the process to verify whether a given speech utterance is from a specific speaker or not. Recent ASV systems have benefited from deep learning by replacing individual components [1, 2, 3], as well as the entire pipeline in an end-to-end manner [4, 5]. Among these, it has been shown to be the most viable and effective to use DNNs in the front-end for discriminative speaker embedding extractions. Speaker embeddings are fixed-length continuous-value representations of input sequences that contain acoustic feature vectors living in large, complex spaces. Using it simplifies ASV task because in the embedding space, probability distributions and geometric concepts are easily applicable. Therefore, speaker embeddings and a scoring back-end are often used together in ASV frameworks [6, 7, 8, 9, 10].
Substantial amount of works have reported on network architectures that produce embedding vectors with better speaker representations [1, 11, 12, 13]. Unlike the generative i-vector extraction model in which a posterior mean and precision are calculated simultaneously [14], deep speaker embedding networks mostly only produce a points estimate without a guarantee or a measure of its precision. Other factors not related to a speaker’s voice, however, do affect the estimation of the embedding vector representing the speaker [15]. As we know, speech utterances exhibit both extrinsic variability due to the background noises, channel distortions, as well as intrinsic variability including the physiological nature of the vocal apparatus and psychological states. It also has been pointed out that shorter speech utterances produce less reliable embeddings and result in poor ASV performance [16]. Therefore, it is essential to measure the precision, or equivalently, the uncertainty of the embeddings.
Few work has been done about the uncertainty of deep speaker embeddings. For speaker diarization for which long recordings are typically cut into very short segments for clustering, a neural network has been used to predict the uncertainty of an x-vector [17], using the output of the statistics pooling layer of the original x-vector extractor. Then they are both input to an agglomerative hierarchical clustering (AHC) algorithm. Xi-vector network [12] with a posterior inference pooling has been proposed to integrate the Bayesian formulation of a linear Gaussian model to speaker-embedding neural networks. The precisions, however, are only used in the calculation for posterior mean vectors from which xi-vectors are obtained and discarded afterwards.
Probabilistic linear discriminant analysis (PLDA) is a promising back-end for deep speaker embeddings [18]. The ability to handle uncertainty has been the cornerstone in the successful use of PLDA generative models [14, 19, 20]. Thus, it is natural to propagate the uncertainty seen in the front-end to PLDA. In this paper, we aim to incorporate the uncertainty that are calculated along the xi-vector extraction into the back-end PLDA scoring. Our contributions are:
i) we derive the formulation for the PLDA scoring with uncertainty propagation and
ii) propose to use a length scaling technique to replace the length normalization to enable its application to an uncertainty propagated back-end.
The paper is organized as follows. Section 2 presents the PLDA with uncertainty propagation. Section 3 derives the uncertainty in xi-vector space and presents a length scaling technique. Section 4 describes our experimental setup, results, and analyses. Section 5 summarizes our work.
2 Speaker Verification scoring with uncertainty
Speaker verification can be accomplished by calculating the similarity between the two speaker embeddings corresponding to an enrollment and a test utterance. PLDA is a supervised parametric scoring method which is widely used in speaker recognition [21, 22]. The propagation of speaker embeddings’ uncertainty in PLDA has been addressed previously. I-vector posterior uncertainty, which is defined by the generative i-vector extraction model, is propagated to PLDA for the quality effect analysis caused by duration and phonetic variability [23, 24]. In [17] probabilistic embeddings, which consist of x-vectors and precisions, have been proposed to work with PLDA . The explicit form of the PLDA scoring log-likelihood ratio function, however, has not been presented.
2.1 Uncertainty propagation in PLDA scoring (UP-PLDA)
Let be a -dimensional speaker embedding vector of a speech utterance with a precision , which measures the uncertainty in embedding extraction process. We assume that the vector is generated from a linear Gaussian model [25]
[TABLE]
with two latent variables: and as the speaker variable
[TABLE]
and as the variable to model the statistical noise in embedding extraction process
[TABLE]
The vector represents the global mean. The matrices is the speaker loading matrix, is the lower-triangular Cholesky decomposition of the covariance , which represents the uncertainty of the vector , and models the residual variances
[TABLE]
Integrating out the latent variables, we arrive at the following marginal density
[TABLE]
where is the between-speaker covariance matrix
[TABLE]
and is an utterance-dependent within-speaker covariance matrix
[TABLE]
The total covariance correspondent to the utterance is the summation of the two covariance matrices
[TABLE]
and thus, is also utterance dependent.
In the testing phase, the log-likelihood ratio (LLR) between the enrollment () and test () embeddings is used to score how likely they are from the same speaker
[TABLE]
The probability density functions (pdf) are assumed to be conditioned on the model parameters and on . The predictive distribution in the numerator evaluated at with mean and covariance equal to
[TABLE]
which consists of estimating speaker vector posterior expectation and precision
[TABLE]
With the relations shown in (6-7) and Woodbury identity for inversion of matrices [26], we use (11) in (10) and obtain another form using the two covariance matrices
[TABLE]
The denominator is also a normal pdf evaluated at with parameters
[TABLE]
The posterior mean and covariance are set to their priors.
When uncertainty of speaker embeddings is not considered, i.e. the covariance is assumed to be zero, (1) becomes the traditional PLDA, and all embeddings share the same between-, within-, and total-speaker covariance matrices. In the log-likelihood function (9), the parameters of the predictive distribution in the numerator at are
[TABLE]
and in the denominator and .
Therefore, PLDA scoring LLR function with or without the uncertainty propagation has the same form. The difference is that with the uncertainty propagation, LLR has the within-speaker covariance that is dependent on the individual recording. They are adapted with an increase of uncertainty of the enrollment or test xi-vector estimate as shown in (12-13).
3 Derive embedding uncertainty from network
3.1 Posterior inference in speaker embedding networks
Xi-vector [12] was proposed to include a posterior inference pooling that integrates the Bayesian formulation of linear Gaussian model to speaker-embedding neural networks. It assumes that a linear Gaussian model is responsible for generating representations and characterizes the frame uncertainty with a covariance matrix associated with each estimate
[TABLE]
where is a latent speaker variable with the prior mean vector and covariance matrix
[TABLE]
and represents uncertainty for the frame
[TABLE]
Given the input sequence and uncertainty estimate, posterior distribution of latent variable is also Gaussian
[TABLE]
with a posterior mean
[TABLE]
and a precision matrix
[TABLE]
Let for , and the index represents the prior such that and , then the posteriors can be written
[TABLE]
After the posterior mean is obtained in the Gaussian posterior inference layer (see Fig 1), it is followed by a batch normalization layer (BN) and a fully connected layer (FC1) from which xi-vectors are extracted.
3.2 Derive xi-vector uncertainty
The posterior precision can be considered a measure of the uncertainty associated with the point estimate of latent variable . It is used in the mean estimation in (19) so as to extract xi-vector indirectly. It is, however, possible to propagate it to the space of xi-vector to represent its uncertainty associated with the embedding estimator besides a point estimate. In this way, the embedding is extended to be a distribution with the mean and variance . To obtain this, we propagate the posterior covariance through the same layers accordingly (see Fig 1). The details are shown in Algorithm 1.
3.3 Embedding length scaling (LS)
Length normalization (LN) together with whitening is a popular pre-processing technique to reinforce speaker embeddings to be more Gaussian distributed [27]. It would be hard, however, to apply it to posterior covariance matrices because of its non-linearity. Therefore, we refine LN into an embedding length scaling (LS) technique
[TABLE]
With the multiplication with in the numerator to preserve the scaling of the original embedding space, we only need to apply LS to the enrollment and test embeddings in evaluations, not to PLDA training embeddings. Note that when a total covariance matrix from the training data is used as , LS would be equivalent to LN but without retraining PLDA models. When the uncertainty of the embeddings is available, it can be propagated to LS (UP-LS) as well by including the embedding posterior covariance
[TABLE]
4 experiments
4.1 Experimental settings
The experiments were conducted on the VoxCeleb [28], the Speaker in the Wild (SITW) core-core eval [29], and the CNCeleb1 dataset [30]. For VoxCeleb1, we exploited the original test set Vox1-O and the hard test set Vox1-H. The front-end networks were trained using the original segments of VoxCeleb2 dataset [31] with augmentations following the settings in [18]. The same training dataset without augmentation was used to train back-ends after all VoxCeleb2 segments belonging to the same session were concatenated. For SITW and CNCeleb evaluations, SITW core-core dev set and CNCeleb1-T dataset were used, respectively, as the development sets for the mean normalization.
We used x-vector and xi-vector networks with an ECAPA-TDNN [11] backbone and optimized them with AAM-Softmax cross-entropy loss [32]. We used their standard forms. In the x-vector network, both weighted means and standard deviations from an attentive statistics pooling layer were propagated, while in xi-vector network only the posterior mean from the Gaussian posterior inference layer was propagated. Both types of embeddings have 192 dimensions. The posterior covariance matrices in the xi-vector network were assumed diagonal [12], and the prior mean and covariance were initialized as . In the testing phase , we obtained the corresponding posterior covariance following Algorithm 1.
For the back-end, the advantage of using the diagonalized within-speaker covariance matrix in PLDA has been proved [18]. We further diagonalized the between-speaker covariance matrix for the computational efficiency. We used the total covariance of the training data in LS for PLDA back-ends and that plus testing utterance’s posterior covariance in UP-LS for UP-PLDA back-end. Raw embeddings were used in PLDA training in the systems where LS was applied to testing data. We used the SpeechBrain open-source toolkit [33] for the front-end implementations and embedding extractions. The input of the neural networks were 80-dimensional filter-bank features. Results are reported in terms of equal error rate (EER) and the minimum normalized detection cost function (MinDCF) at and .
4.2 Experimental results
We first compare LS and LN pre-processing techniques in both x-vector and xi-vector PLDA systems. As shown in Tab. 1, the uses of LN and LS give almost the same results, and they are consistently better than those without any pre-processing. It proves the equivalency between LN and LS when using the total covariance matrix as mentioned in 3.3. Thus, we next apply LS in the back-end of PLDA with uncertainty. The advantage of xi-vectors is not obvious over x-vectors. We argue that features with different time-scales derived from the dense connection in ECAPA-TDNN may cause confusion to the xi-vector posterior inference process.
Next we evaluate the uncertainty propagation in LS pre-processing and in PLDA, referred as UP-LS and UP-PLDA, respectively. Table 2 shows a significant improvement due to the explicit use of xi-vector uncertainty in UP-PLDA scoring (line 3) over the system without it (line 1). Such observations are consistent in all four evaluation sets. It indicates that the explicit use of uncertainty in the back-end is more effective than its implicit use in xi-vector estimation. For the Vox1 and SITW evaluations, UP-PLDA yields a greater improvement in both EER and minDCF than that LS does, while for CNCeleb evaluation, UP-PLDA gives a better EER but a slight increase in minDCF. Further application of UP-LS pre-processing to xi-vectors, for UP-PLDA back-end, improved ASV performance in the SITW and CNCeleb evaluation sets in which certain mismatches exist in domains. Overall, the use of LS pre-processing and uncertainty propagation in PLDA together achieved the best performance, with –41.3 and – reductions, respectively, in EER and minDCF. For the SITW and CNCeleb sets, we also show in Tab 2 the performance using the mean of the training data for the embedding centralization. The comparison shows a clear advantage of using the mean adaptation in domain mismatched evaluations.
The CNCeleb evaluations give large values in EER and minDCF in all the systems (see Tab 1 and Tab 2) due to its severe conditions. We next investigate the UP-PLDA and LS effects in each genre, as shown in Fig 2. Despite the difference in language and speaking styles between CNCeleb set and the training data VoxCeleb, the observations of the improvement due to the use of UP-PLDA and LS pre-processing, in most of the genres, are consistent with the overall results in Tab 2.
5 Summary
This paper has revisited uncertainty propagation in PLDA. Based on the xi-vector framework, we derive a posterior covariance matrix from frame-wise precisions, to measures the uncertainty of speaker embeddings. We propose a log-likelihood ratio function for the PLDA scoring with the propagation of embedding uncertainty. At last, we propose to replace the length normalization pre-processing technique with a length scaling technique for the application of uncertainty propagation in the back-end. Experimental results on the VoxCeleb-1, SITW core-core eval sets as well as the domain-mismatched CNCeleb1 set show the effectiveness of the two techniques with –41.3 and – reductions, respectively, in EER and minDCF. In future, we will investigate the networks with different backbones and the use of uncertainty in domain adaptation.
6 Acknowledgements
This project is supported by the Agency for Science, Technology and Research (A⋆STAR), Singapore, through its Council Research Fund (Project No. CR-2021-005).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP , 2018, pp. 5329–5333.
- 2[2] Y. Tang, G. Ding, J. Huang, X. He, and B. Zhou, “Deep speaker embedding learning with multi-level pooling for text-independent speaker verification,” in Proc. IEEE ICASSP , 2019, pp. 6116–6120.
- 3[3] J. Chien and C. Hsu, “Variational manifold learning for speaker recognition,” in Proc. IEEE ICASSP , 2017, pp. 4935–4939.
- 4[4] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” in ar Xiv:1705.02304 , 2017.
- 5[5] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matejka, L. Burget, and O. Glembek, “End-to-end DNN based text-independent speaker recognition for long and short utterances,” in Computer Speech & Language , 2020, pp. 22–35.
- 6[6] K. A. Lee, V. Hautamaki, T. Kinnunen, H. Yamamoto, K. Okabe, et al., “I 4U submission to NIST SRE 2018: Leveraging from a decade of shared experiences,” in Proc. Interspeech , 2019, pp. 1497–1501.
- 7[7] P. Matejka, O. Plchot, O. Glembek, L. Burget, J. Rohdin, et al., “13 years of speaker recognition research at BUT, with longitudinal analysis of NIST SRE,” in Computer Speech & Language , 2020, vol. 63, p. 101035.
- 8[8] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero amd A. Mc Cree, et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE 18 and Speakers in the Wild evaluations,” in Computer Speech & Language , 2020, vol. 60, p. 101026.
