Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models
Ruchao Fan, Natarajan Balaji Shankar, and Abeer Alwan

TL;DR
This paper presents a comprehensive benchmark for child speech recognition using various speech foundation models, compares finetuning strategies, and introduces a new regularization method to improve stability.
Contribution
It systematically evaluates SFMs on child ASR, compares finetuning techniques, and proposes PIF loss for more stable model training.
Findings
PEFT matches full finetuning for large models
Model behavior varies with size and finetuning method
Proposed PIF loss improves finetuning stability
Abstract
Speech foundation models (SFMs) have achieved state-of-the-art results for various speech tasks in supervised (e.g. Whisper) or self-supervised systems (e.g. WavLM). However, the performance of SFMs for child ASR has not been systematically studied. In addition, there is no benchmark for child ASR with standard evaluations, making the comparisons of novel ideas difficult. In this paper, we initiate and present a comprehensive benchmark on several child speech databases based on various SFMs (Whisper, Wav2vec2.0, HuBERT, and WavLM). Moreover, we investigate finetuning strategies by comparing various data augmentation and parameter-efficient finetuning (PEFT) methods. We observe that the behaviors of these methods are different when the model size increases. For example, PEFT matches the performance of full finetuning for large models but worse for small models. To stabilize finetuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Phonetics and Phonology Research
