Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Chenxu Guo; Jiachen Lian; Xuanru Zhou; Jinming Zhang; Shuhe Li; Zongli Ye; Hwi Joo Park; Anaisha Das; Zoe Ezzes; Jet Vonk; Brittany Morin; Rian Bogley; Lisa Wauters; Zachary Miller; Maria Gorno-Tempini; Gopala Anumanchipalli

arXiv:2505.16351·eess.AS·May 27, 2025

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

PDF

Open Access 1 Repo

TL;DR

This paper presents Dysfluent-WFST, a zero-shot speech dysfluency transcription and detection framework that improves accuracy without additional training, aiding clinical assessment of disordered speech.

Contribution

Introduces Dysfluent-WFST, a novel zero-shot decoder that transcribes phonemes and detects dysfluency using existing encoders, outperforming previous methods without extra training.

Findings

01

Achieves state-of-the-art phonetic error rate and dysfluency detection.

02

Operates effectively with upstream encoders like WavLM.

03

Lightweight, interpretable, and improves dysfluency processing.

Abstract

Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

berkeley-speech-group/dysfluentwfst
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStuttering Research and Treatment · Voice and Speech Disorders · Speech Recognition and Synthesis