Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification
Xiaoyi Qin, Yaogen Yang, Lin Yang, Xuyang Wang, Junjie, Wang, Ming Li

TL;DR
This paper investigates voice conversion techniques for data augmentation to enhance text-dependent speaker verification, demonstrating significant performance improvements with limited training data.
Contribution
It introduces the use of voice conversion methods for data augmentation in speaker verification, showing their effectiveness over simple re-sampling.
Findings
Equal Error Rate reduced from 6.51% to 4.51%.
Voice conversion-based augmentation improves verification accuracy.
Simple re-sampling is less effective than voice conversion methods.
Abstract
In this paper, we focus on improving the performance of the text-dependent speaker verification system in the scenario of limited training data. The speaker verification system deep learning based text-dependent generally needs a large scale text-dependent training data set which could be labor and cost expensive, especially for customized new wake-up words. In recent studies, voice conversion systems that can generate high quality synthesized speech of seen and unseen speakers have been proposed. Inspired by those works, we adopt two different voice conversion methods as well as the very simple re-sampling approach to generate new text-dependent speech samples for data augmentation purposes. Experimental results show that the proposed method significantly improves the Equal Error Rare performance from 6.51% to 4.51% in the scenario of limited training data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
