Zero Shot Text to Speech Augmentation for Automatic Speech Recognition   on Low-Resource Accented Speech Corpora

Francesco Nespoli; Daniel Barreda; Patrick A. Naylor

arXiv:2409.11107·eess.AS·September 18, 2024

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Francesco Nespoli, Daniel Barreda, Patrick A. Naylor

PDF

Open Access

TL;DR

This paper explores using zero-shot text-to-speech augmentation to improve automatic speech recognition accuracy on accented speech, reducing errors especially when real accented data is scarce.

Contribution

It introduces a zero-shot TTS-based augmentation strategy that enhances ASR performance on accented speech with limited real data, a novel approach in low-resource scenarios.

Findings

01

Up to 5% WERR improvement with augmentation

02

Synthetic data combined with real data outperforms real data alone

03

Method effective for under-represented accents in training data

Abstract

In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing