VB calibration to improve the interface between phone recognizer and i-vector extractor
Niko Br\"ummer

TL;DR
This paper reinterprets the classical i-vector extractor as a mean-field variational Bayes method, unifies it with phonetic extractors, and proposes calibration techniques to enhance accuracy.
Contribution
It provides a VB-based theoretical framework for i-vector extractors and introduces posterior calibration methods to improve their performance.
Findings
VB interpretation unifies classical and phonetic i-vector extractors
Posterior calibration improves extractor accuracy
Proposed modifications lead to better model fit
Abstract
The EM training algorithm of the classical i-vector extractor is often incorrectly described as a maximum-likelihood method. The i-vector model is however intractable: the likelihood itself and the hidden-variable posteriors needed for the EM algorithm cannot be computed in closed form. We show here that the classical i-vector extractor recipe is actually a mean-field variational Bayes (VB) recipe. This theoretical VB interpretation turns out to be of further use, because it also offers an interpretation of the newer phonetic i-vector extractor recipe, thereby unifying the two flavours of extractor. More importantly, the VB interpretation is also practically useful: it suggests ways of modifying existing i-vector extractors to make them more accurate. In particular, in existing methods, the approximate VB posterior for the GMM states is fixed, while only the parameters of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Blind Source Separation Techniques
