Improving Accented Speech Recognition using Data Augmentation based on   Unsupervised Text-to-Speech Synthesis

Cong-Thanh Do; Shuhei Imai; Rama Doddipatla; Thomas Hain

arXiv:2407.04047·cs.CL·July 8, 2024

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Cong-Thanh Do, Shuhei Imai, Rama Doddipatla, Thomas Hain

PDF

Open Access

TL;DR

This paper proposes an unsupervised data augmentation method using text-to-speech synthesis to improve accented speech recognition, achieving significant word error rate reductions in experiments with Wav2vec2.0 models.

Contribution

It introduces a novel unsupervised TTS-based data augmentation approach for accented speech recognition, reducing reliance on manual transcriptions.

Findings

01

Up to 6.1% relative WER reduction with synthetic accented speech data.

02

Effective use of small accented speech datasets for TTS training.

03

Improved ASR performance in accented speech scenarios.

Abstract

This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis