Improving child speech recognition with augmented child-like speech

Yuanyuan Zhang; Zhengjun Yue; Tanvina Patel; Odette Scharenborg

arXiv:2406.10284·cs.CL·January 9, 2025

Improving child speech recognition with augmented child-like speech

Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

PDF

Open Access

TL;DR

This paper enhances child speech recognition by using cross-lingual voice conversion to augment training data, significantly reducing word error rates in ASR models for child speech.

Contribution

It introduces a novel cross-lingual child-to-child voice conversion approach to augment data for improving child speech recognition performance.

Findings

01

Cross-lingual VC significantly improves ASR performance.

02

Two-fold data augmentation reduces WER by ~3%.

03

Six-fold augmentation reduces WER by 3.6%.

Abstract

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Infant Health and Development · Speech Recognition and Synthesis