SpeechStew: Simply Mix All Available Speech Recognition Data to Train   One Large Neural Network

William Chan; Daniel Park; Chris Lee; Yu Zhang; Quoc Le; Mohammad; Norouzi

arXiv:2104.02133·cs.CL·April 28, 2021·75 cites

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad, Norouzi

PDF

Open Access

TL;DR

SpeechStew trains a large neural speech recognition model by simply combining multiple public datasets, achieving state-of-the-art results without external language models and demonstrating strong transfer learning capabilities.

Contribution

The paper introduces SpeechStew, a straightforward method of mixing diverse speech datasets to train a single large neural network with competitive performance.

Findings

01

Achieves near state-of-the-art WER on multiple benchmarks.

02

Outperforms prior work without external language models.

03

Demonstrates effective transfer learning on low-resource data.

Abstract

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing