Towards Selection of Text-to-speech Data to Augment ASR Training

Shuo Liu; Leda Sar{\i}; Chunyang Wu; Gil Keren; Yuan Shangguan; Jay; Mahadeokar; Ozlem Kalinli

arXiv:2306.00998·eess.AS·June 5, 2023·1 cites

Towards Selection of Text-to-speech Data to Augment ASR Training

Shuo Liu, Leda Sar{\i}, Chunyang Wu, Gil Keren, Yuan Shangguan, Jay, Mahadeokar, Ozlem Kalinli

PDF

Open Access

TL;DR

This paper introduces a neural network-based method to select synthetic TTS samples that enhance ASR training, reducing data size while maintaining accuracy, thus improving efficiency in speech recognition systems.

Contribution

We propose a novel data selection approach using a neural similarity measure to optimize TTS data inclusion for ASR training, outperforming baseline methods.

Findings

01

Synthetic samples with lexical dissimilarity improve ASR performance.

02

Our method reduces TTS data requirements below 30% of original size.

03

Maintains speech recognition accuracy comparable to using all TTS data.

Abstract

This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating synthetic samples with considerable dissimilarity to real speech, owing in part to lexical differences, into ASR training is crucial for boosting recognition performance. Experimental results on Librispeech test sets indicate that, in order to maintain the same speech recognition accuracy as when using all TTS data, our proposed solution can reduce the size of the TTS data down below its $30 %$ , which is superior to several baseline methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsTest · Additive Angular Margin Loss