Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora
Francesco Nespoli, Daniel Barreda, Patrick A. Naylor

TL;DR
This paper explores using zero-shot text-to-speech augmentation to improve automatic speech recognition accuracy on accented speech, reducing errors especially when real accented data is scarce.
Contribution
It introduces a zero-shot TTS-based augmentation strategy that enhances ASR performance on accented speech with limited real data, a novel approach in low-resource scenarios.
Findings
Up to 5% WERR improvement with augmentation
Synthetic data combined with real data outperforms real data alone
Method effective for under-represented accents in training data
Abstract
In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
