Comparison of Multiple Features and Modeling Methods for Text-dependent Speaker Verification
Yi Liu, Liang He, Yao Tian, Zhuzi Chen, Jia Liu and, Michael T. Johnson

TL;DR
This study compares four modeling methods for text-dependent speaker verification on the RedDots dataset, analyzing the impact of frame alignment algorithms and features, and finds that HMM-based models excel with fixed phrases, while bottleneck features are less effective in challenging scenarios.
Contribution
Introduces and compares four modeling methods for text-dependent speaker verification, analyzing the effects of frame alignment and features on performance.
Findings
HMM-based models perform well with fixed phrases.
Forward-backward algorithm benefits i-vector/HMM systems.
Bottleneck features do not outperform MFCCs in challenging trials.
Abstract
Text-dependent speaker verification is becoming popular in the speaker recognition society. However, the conventional i-vector framework which has been successful for speaker identification and other similar tasks works relatively poorly in this task. Researchers have proposed several new methods to improve performance, but it is still unclear that which model is the best choice, especially when the pass-phrases are prompted during enrollment and test. In this paper, we introduce four modeling methods and compare their performance on the newly published RedDots dataset. To further explore the influence of different frame alignments, Viterbi and forward-backward algorithms are both used in the HMM-based models. Several bottleneck features are also investigated. Our experiments show that, by explicitly modeling the lexical content, the HMM-based modeling achieves good results in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
