A Wav2vec2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition
Rishabh Jain, Andrei Barcovschi, Mariam Yiwere, Dan Bigioi, and Peter Corcoran, Horia Cucu

TL;DR
This study investigates how self-supervised learning with wav2vec2 can enhance child speech recognition, achieving significant improvements with limited child speech data compared to previous methods.
Contribution
It demonstrates effective use of wav2vec2 with various training data configurations to improve child speech recognition accuracy with minimal data.
Findings
Achieved best WER of 7.42 on MyST dataset
Outperformed state-of-the-art wav2vec2 BASE 960 with only 10 hours of child speech data
Analyzed effects of different training data types on model performance
Abstract
Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.99 on the PFSTAR dataset and 12.47 on the CMU KIDS dataset as compared to any other previous methods. Our models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsBalanced Selection
