Benchmarking Children's ASR with Supervised and Self-supervised Speech   Foundation Models

Ruchao Fan; Natarajan Balaji Shankar; and Abeer Alwan

arXiv:2406.10507·eess.AS·June 18, 2024

Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

Ruchao Fan, Natarajan Balaji Shankar, and Abeer Alwan

PDF

Open Access 1 Repo

TL;DR

This paper presents a comprehensive benchmark for child speech recognition using various speech foundation models, compares finetuning strategies, and introduces a new regularization method to improve stability.

Contribution

It systematically evaluates SFMs on child ASR, compares finetuning techniques, and proposes PIF loss for more stable model training.

Findings

01

PEFT matches full finetuning for large models

02

Model behavior varies with size and finetuning method

03

Proposed PIF loss improves finetuning stability

Abstract

Speech foundation models (SFMs) have achieved state-of-the-art results for various speech tasks in supervised (e.g. Whisper) or self-supervised systems (e.g. WavLM). However, the performance of SFMs for child ASR has not been systematically studied. In addition, there is no benchmark for child ASR with standard evaluations, making the comparisons of novel ideas difficult. In this paper, we initiate and present a comprehensive benchmark on several child speech databases based on various SFMs (Whisper, Wav2vec2.0, HuBERT, and WavLM). Moreover, we investigate finetuning strategies by comparing various data augmentation and parameter-efficient finetuning (PEFT) methods. We observe that the behaviors of these methods are different when the model size increases. For example, PEFT matches the performance of full finetuning for large models but worse for small models. To stabilize finetuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Diamondfan/SPAPL_KidsASR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Phonetics and Phonology Research