Training Neural Speech Recognition Systems with Synthetic Speech   Augmentation

Jason Li; Ravi Gadde; Boris Ginsburg; Vitaly Lavrukhin

arXiv:1811.00707·cs.CL·November 5, 2018·41 cites

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin

PDF

Open Access

TL;DR

This paper explores augmenting natural speech datasets with synthetic speech to improve neural speech recognition systems, achieving state-of-the-art results without external language models.

Contribution

It introduces a method of augmenting datasets with synthetic speech and demonstrates improved performance of large end-to-end neural ASR models.

Findings

01

Achieved state-of-the-art Word Error Rate on LibriSpeech

02

Synthetic speech augmentation enhances model accuracy

03

No external language model needed for top performance

Abstract

Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-to-end neural speech recognition models using the LibriSpeech dataset augmented with synthetic speech. These new models achieve state of the art Word Error Rate (WER) for character-level based models without an external language model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling