StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

Qianheng Xu

arXiv:2510.18938·eess.AS·November 6, 2025

StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

Qianheng Xu

PDF

Open Access

TL;DR

This paper introduces two novel end-to-end models, StutterZero and StutterFormer, that convert stuttered speech into fluent speech while transcribing, outperforming existing pipelines in accuracy and semantic similarity.

Contribution

The work presents the first end-to-end waveform-to-waveform models for simultaneous stutter correction and transcription, trained on synthesized paired data and evaluated on unseen speakers.

Findings

01

StutterZero reduces WER by 24% and improves BERTScore by 31%.

02

StutterFormer achieves a 28% decrease in WER and a 34% BERTScore improvement.

03

Both models outperform the leading Whisper-Medium model on benchmarks.

Abstract

Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStuttering Research and Treatment · Speech Recognition and Synthesis · Phonetics and Phonology Research