Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

TL;DR
This paper introduces a CycleGAN-based voice conversion method to adapt training speech data to test speakers, significantly improving ASR performance on low-resource endangered language corpora.
Contribution
It presents a novel speaker adaptation technique using non-parallel voice conversion to enhance ASR accuracy for endangered languages with limited data.
Findings
35-60% relative reduction in phone error rate on Ainu corpus
40% relative reduction in phone error rate on Mboshi corpus
Outperforms conventional unsupervised and multilingual training methods
Abstract
It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are often left with a lot of recordings of spontaneous speech data that have to be transcribed. In this work, for mitigating this speaker sparsity problem, we propose to convert the whole training speech data and make it sound like the test speaker in order to develop a highly accurate ASR system for this speaker. For this purpose, we utilize a CycleGAN-based non-parallel voice conversion technology to forge a labeled training data that is close to the test speaker's speech. We evaluated this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
