A Wav2vec2-Based Experimental Study on Self-Supervised Learning Methods   to Improve Child Speech Recognition

Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Dan Bigioi; and Peter Corcoran; Horia Cucu

arXiv:2204.05419·eess.AS·February 14, 2023·5 cites

A Wav2vec2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

Rishabh Jain, Andrei Barcovschi, Mariam Yiwere, Dan Bigioi, and Peter Corcoran, Horia Cucu

PDF

Open Access

TL;DR

This study investigates how self-supervised learning with wav2vec2 can enhance child speech recognition, achieving significant improvements with limited child speech data compared to previous methods.

Contribution

It demonstrates effective use of wav2vec2 with various training data configurations to improve child speech recognition accuracy with minimal data.

Findings

01

Achieved best WER of 7.42 on MyST dataset

02

Outperformed state-of-the-art wav2vec2 BASE 960 with only 10 hours of child speech data

03

Analyzed effects of different training data types on model performance

Abstract

Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.99 on the PFSTAR dataset and 12.47 on the CMU KIDS dataset as compared to any other previous methods. Our models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsBalanced Selection