KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening
Rohan Sharma, Dancheng Liu, Jingchen Sun, Shijie Zhou, Jiayu Qin, Jinjun Xiong, Changyou Chen

TL;DR
KidSpeak is a novel multi-task foundation model tailored for children's speech recognition and screening, integrating phonetic knowledge and a new alignment tool to improve accuracy and data quality in pediatric speech AI applications.
Contribution
The paper introduces KidSpeak, a multi-purpose speech foundation model for children, and FASA, a novel speech alignment tool, addressing limitations of existing adult-focused speech models.
Findings
Achieved 87% accuracy across four tasks.
FASA improves speech alignment quality by 13.6x.
First comprehensive solution for pediatric speech AI and therapy.
Abstract
With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper is providing advancements in the area of children's speech analysis which is an under-researched area and could use more innovation. Part of the problem is the lack of corpora which this paper alleviates. The results show large improvements over the quoted baselines 1. in phonetic error rate on Timit over Whisper, and 2. a variety of tasks over Panda GPT. FASA is also shown to provide better alignments over human annotators.
In table 2, of this paper, the Wav2Vec 2.0 number is quoted as 9.7% PER, however in the Wav2Vec paper (https://arxiv.org/pdf/2006.11477v3) it indicates 8.3% and is lead entry in paperswithcode (https://paperswithcode.com/sota/speech-recognition-on-timit). It seems the pre-training with a latent phonetic space as in wav2vec + fine-tuning is enough to yield good results. In table 4, the computing both WTA and CTA averages (word and char transcription) and then including the inthe multi-task perfo
Children’s speech recognition and diagnosis is a critical yet underexplored area in the existing literature.
1. The presentation of this paper falls short of academic standards, resembling a casual blog post rather than a formal research paper. Lines 82-86 include overly flashy content that detracts from readability and lacks appropriate captions for clarity. Furthermore, the equations on lines 213-215 are unnumbered, which disrupts the structure. The formatting in lines 439-455, featuring diamond-shaped bullet points, italicized beginnings, and inline numbered items with a black background and white t
1) This paper tackles a crucial question in developing a foundational model for processing children’s speech, a relatively underexplored area. By integrating large language models (LLMs) like Whisper and Vicuna and fine-tuning Vicuna with LoRA, the study demonstrates improved performance over the baseline method. 2) A two-stage training procedure integrates phonetic information into the Whisper speech encoder, improving downstream performance.
1) A major limitation of this work is that the proposed KidSpeak model is only compared against a single baseline (PandaGPT). Numerous multimodal models, such as ImageBind [1] and NextGPT [2], among others, are available for comparison, along with robust single-task models (such as wav2vec 2.0, Whisper, and HuBERT) specifically designed for age classification, gender classification, automatic speech recognition, and other relevant tasks, whose results could also be included. 2) The proposed m
The strengths of the paper are: 1. The paper is well organized and written. 2. The contribution is significant in the area of child speech and language modeling as the data quality of child speech corpus is always low. The proposed alignment tool can greatly boost the development of better models for child speech. 3. The paper is the first to use speech-LLM for child speech and proposed to train whisper-MH (multi-decoder whisper) to enhance the phonetic information in the whisper encoder.
The weaknesses of the paper are: 1. Insufficient experiments. The experiment section is far shorter than other sections. The current experiments seem to be not sufficient to support the claims made in the paper. 2. Missing references. In related work (Kids’ Speech), the authors is talking about child ASR remains under research. In fact, there are many papers recently are using speech foundation models for child speech recognition, following the ASR system development. e.g. [1] Ruchao Fan, Nat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage Development and Disorders · Speech Recognition and Synthesis · Face recognition and analysis
