Improving child speech recognition with augmented child-like speech
Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

TL;DR
This paper enhances child speech recognition by using cross-lingual voice conversion to augment training data, significantly reducing word error rates in ASR models for child speech.
Contribution
It introduces a novel cross-lingual child-to-child voice conversion approach to augment data for improving child speech recognition performance.
Findings
Cross-lingual VC significantly improves ASR performance.
Two-fold data augmentation reduces WER by ~3%.
Six-fold augmentation reduces WER by 3.6%.
Abstract
State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Infant Health and Development · Speech Recognition and Synthesis
