Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults
Ahmed Adel Attia, Jing Liu, Wei Ai, Dorottya Demszky, Carol, Espy-Wilson

TL;DR
This paper improves children's speech recognition in ASR systems by enhancing data preprocessing and demonstrating significant WER reductions with Whisper models, addressing the performance gap between children and adults.
Contribution
It introduces more efficient data preprocessing techniques for the MyST dataset and demonstrates substantial WER improvements in children's speech recognition using Whisper models.
Findings
WER reduced to 9.11% with Whisper-Small
WER reduced to 8.61% with Whisper-Medium
improvement generalizes to unseen datasets
Abstract
Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
