Improved Dysarthric Speech to Text Conversion via TTS Personalization
P\'eter Mihajlik, \'Eva Sz\'ekely, Piroska Barta, M\'at\'e Soma K\'ad\'ar, Gergely Dobsinszki, L\'aszl\'o T\'oth

TL;DR
This paper demonstrates that fine-tuning a speech-to-text model with synthetic and real dysarthric speech significantly improves transcription accuracy for individuals with severe speech impairments.
Contribution
The study introduces a novel method for generating synthetic dysarthric speech with controlled severity for personalized ASR fine-tuning.
Findings
CER reduced from 36-51% to 7.3% after fine-tuning
Synthetic speech inclusion yields 18% relative CER reduction
Personalized models outperform general models like Whisper-turbo
Abstract
We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research
