Can large audio language models understand child stuttering speech? speech summarization, and source separation

Chibuzor Okocha; Maya Bakri; Christan Grant

arXiv:2510.20850·eess.AS·October 27, 2025

Can large audio language models understand child stuttering speech? speech summarization, and source separation

Chibuzor Okocha, Maya Bakri, Christan Grant

PDF

TL;DR

This paper evaluates the ability of large audio-language models to understand and summarize disfluent child speech, focusing on source separation and clinical relevance, revealing conditions for reliable child-only summaries.

Contribution

It systematically assesses state-of-the-art LALMs on child speech tasks, providing insights and practical guidance for clinical and educational applications.

Findings

01

LALMs can produce faithful child-only summaries under certain conditions

02

Model-human agreement varies depending on speech disfluency and task complexity

03

Guidelines and tools are provided for reliable child speech processing

Abstract

Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks) further challenge Automatic Speech Recognition (ASR) and downstream Natural Language Processing (NLP). Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding; however, their behavior in disfluent child speech remains underexplored. We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child). The tasks are (i) single-channel source separation to isolate the child and (ii) child-only summarization that preserves clinically relevant disfluencies and avoids adult-speech leakage. Evaluation combines Large Language Model (LLM) as a judge, human expert ratings, and BERTScore (F1), and we report agreement between models and between models and humans to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.