Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared, Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg, Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han,, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin

TL;DR
This paper demonstrates an end-to-end deep learning system for speech recognition in English and Mandarin, achieving high accuracy, efficiency, and scalability through HPC techniques and GPU deployment.
Contribution
It introduces a novel end-to-end speech recognition model that handles multiple languages and noisy environments, with significant speedups and scalable deployment methods.
Findings
Achieved 7x speedup over previous systems
System performs comparably to human transcribers on standard datasets
Enables low-latency, large-scale online speech recognition
Abstract
We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
