ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
Jiaming Zhou, Shiyao Wang, Shiwan Zhao, Jiabei He, Haoqin Sun, Hui, Wang, Cheng Liu, Aobo Kong, Yujie Guo, Xi Yang, Yequan Wang, Yonghua Lin and, Yong Qin

TL;DR
This paper introduces ChildMandarin, a comprehensive Mandarin speech dataset for children aged 3-5, enabling improved ASR and speaker verification research for young children's speech in Mandarin.
Contribution
The paper presents a new, large-scale Mandarin speech dataset for young children, with detailed analysis and evaluation of ASR and SV models, addressing a critical resource gap.
Findings
Fine-tuning pre-trained models significantly improves ASR performance.
The dataset supports effective speaker verification despite children's vocal variability.
ASR models trained from scratch show promising results on child speech.
Abstract
Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children's speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
