Training Data Augmentation for Dysarthric Automatic Speech Recognition   by Text-to-Dysarthric-Speech Synthesis

Wing-Zin Leung; Mattias Cross; Anton Ragni; Stefan Goetze

arXiv:2406.08568·cs.SD·June 14, 2024

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Wing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel data augmentation approach using diffusion-based text-to-dysarthric-speech synthesis to improve dysarthric speech recognition, addressing data scarcity and variability issues.

Contribution

It introduces a diffusion-based TTDS method for augmenting training data, enhancing the performance of large ASR models on dysarthric speech recognition tasks.

Findings

01

Improved synthesis quality metrics

02

Enhanced ASR accuracy on dysarthric speech

03

Outperforms existing DASR baselines

Abstract

Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. However, progress in dysarthric ASR (DASR) has been limited by high variability in dysarthric speech and limited public availability of dysarthric training data. This paper demonstrates that data augmentation using text-to-dysarthic-speech (TTDS) synthesis for finetuning large ASR models is effective for DASR. Specifically, diffusion-based text-to-speech (TTS) models can produce speech samples similar to dysarthric speech that can be used as additional training data for fine-tuning ASR foundation models, in this case Whisper. Results show improved synthesis metrics and ASR performance for the proposed multi-speaker diffusion-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WingZLeung/TTDS
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research