Can large audio language models understand child stuttering speech? speech summarization, and source separation
Chibuzor Okocha, Maya Bakri, Christan Grant

TL;DR
This paper evaluates the ability of large audio-language models to understand and summarize disfluent child speech, focusing on source separation and clinical relevance, revealing conditions for reliable child-only summaries.
Contribution
It systematically assesses state-of-the-art LALMs on child speech tasks, providing insights and practical guidance for clinical and educational applications.
Findings
LALMs can produce faithful child-only summaries under certain conditions
Model-human agreement varies depending on speech disfluency and task complexity
Guidelines and tools are provided for reliable child speech processing
Abstract
Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks) further challenge Automatic Speech Recognition (ASR) and downstream Natural Language Processing (NLP). Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding; however, their behavior in disfluent child speech remains underexplored. We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child). The tasks are (i) single-channel source separation to isolate the child and (ii) child-only summarization that preserves clinically relevant disfluencies and avoids adult-speech leakage. Evaluation combines Large Language Model (LLM) as a judge, human expert ratings, and BERTScore (F1), and we report agreement between models and between models and humans to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
