Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Dario Amodei; Rishita Anubhai; Eric Battenberg; Carl Case; Jared; Casper; Bryan Catanzaro; Jingdong Chen; Mike Chrzanowski; Adam Coates; Greg; Diamos; Erich Elsen; Jesse Engel; Linxi Fan; Christopher Fougner; Tony Han,; Awni Hannun; Billy Jun; Patrick LeGresley; Libby Lin; Sharan Narang; Andrew; Ng; Sherjil Ozair; Ryan Prenger; Jonathan Raiman; Sanjeev Satheesh; David; Seetapun; Shubho Sengupta; Yi Wang; Zhiqian Wang; Chong Wang; Bo Xiao; Dani; Yogatama; Jun Zhan; Zhenyao Zhu

arXiv:1512.02595·cs.CL·December 9, 2015·2.2k cites

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared, Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg, Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han,, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin

PDF

Open Access 5 Repos 1 Datasets

TL;DR

This paper demonstrates an end-to-end deep learning system for speech recognition in English and Mandarin, achieving high accuracy, efficiency, and scalability through HPC techniques and GPU deployment.

Contribution

It introduces a novel end-to-end speech recognition model that handles multiple languages and noisy environments, with significant speedups and scalable deployment methods.

Findings

01

Achieved 7x speedup over previous systems

02

System performs comparably to human transcribers on standard datasets

03

Enables low-latency, large-scale online speech recognition

Abstract

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

gilkeyio/librispeech-alignments
dataset· 1.6k dl
1.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling