Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection
Shangkun Huang, Jing Deng, Jintao Kang, Rong Zheng

TL;DR
This paper introduces a novel LLM-driven multi-task framework that jointly improves speech recognition and stuttering event detection, significantly enhancing performance on Mandarin stuttering speech datasets.
Contribution
It presents a unified architecture with dynamic interaction mechanisms and contrastive learning for better recognition and detection of stuttering speech, a novel approach in this domain.
Findings
Reduced CER to 5.45%, a 37.71% relative improvement.
Achieved an SED F1-score of 73.63%, a 46.58% relative increase.
Demonstrated effectiveness on the AS-70 Mandarin dataset.
Abstract
The performance bottleneck of Automatic Speech Recognition (ASR) in stuttering speech scenarios has limited its applicability in domains such as speech rehabilitation. This paper proposed an LLM-driven ASR-SED multi-task learning framework that jointly optimized the ASR and Stuttering Event Detection (SED) tasks. We proposed a dynamic interaction mechanism where the ASR branch leveraged CTC-generated soft prompts to assist LLM context modeling, while the SED branch output stutter embeddings to enhance LLM comprehension of stuttered speech. We incorporated contrastive learning to strengthen the discriminative power of stuttering acoustic features and applied Focal Loss to mitigate the long-tailed distribution in stuttering event categories. Evaluations on the AS-70 Mandarin stuttering dataset demonstrated that our framework reduced the ASR character error rate (CER) to 5.45% (-37.71%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStuttering Research and Treatment · Speech Recognition and Synthesis · Speech and dialogue systems
