End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation

Minghui Wu; Haitao Tang; Jiahuan Fan; Ruizhi Liao; Yanyong Zhang

arXiv:2603.01382·cs.SD·March 3, 2026

End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation

Minghui Wu, Haitao Tang, Jiahuan Fan, Ruizhi Liao, Yanyong Zhang

PDF

Open Access

TL;DR

This paper introduces an end-to-end dysarthric speech reconstruction system that reduces latency and improves robustness by using a frame-level adaptor and multi-view knowledge distillation, outperforming previous methods.

Contribution

The study proposes a novel end-to-end DSR system with a frame-level adaptor and multiple wait-k TTS, enhancing robustness and prosody in real-time dysarthric speech reconstruction.

Findings

01

Average response time of 1.03 seconds

02

Achieved MOS of 4.67 on UASpeech

03

Reduced WER by 54.25% compared to state-of-the-art

Abstract

Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Phonocardiography and Auscultation Techniques