Comparing the Benefit of Synthetic Training Data for Various Automatic   Speech Recognition Architectures

Nick Rossenbach; Mohammad Zeineldeen; Benedikt Hilmes; Ralf; Schl\"uter; Hermann Ney

arXiv:2104.05379·cs.CL·July 14, 2021

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf, Schl\"uter, Hermann Ney

PDF

TL;DR

This paper evaluates the impact of synthetic training data on various ASR architectures, showing significant benefits for attention encoder-decoder models and outperforming state-of-the-art hybrid systems on LibriSpeech-100h.

Contribution

It systematically compares synthetic data effects across multiple ASR architectures and introduces internal language model subtraction for the first time.

Findings

01

Synthetic data improves AED training performance.

02

Hybrid and CTC-based systems show minimal benefit from synthetic data.

03

Hybrid system achieves 3.3%/10.0% WER on LibriSpeech-100h, surpassing previous state-of-the-art.

Abstract

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.