DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement
Minghui Wu, Xueling Liu, Jiahuan Fan, Haitao Tang, Yanyong Zhang, Yue Zhang

TL;DR
DARS is a novel dysarthria-aware speech synthesis framework that models pathological rhythm and style to improve ASR accuracy on dysarthric speech, achieving significant reductions in word error rate.
Contribution
It introduces a multi-stage rhythm predictor and style matching mechanism based on Matcha-TTS, specifically designed for dysarthric speech augmentation.
Findings
Achieves a Mean Cepstral Distortion of 4.29, closely matching real dysarthric speech.
Reduces WER by 54.22% when used for ASR data augmentation.
Demonstrates effectiveness on the TORGO dataset.
Abstract
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Music and Audio Processing
