Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
Hagen Soltau, Hank Liao, Hasim Sak

TL;DR
This paper introduces a large vocabulary speech recognition system using deep bi-directional LSTM RNNs with CTC loss that models entire words directly, eliminating the need for phoneme-based units or language models.
Contribution
It presents a novel end-to-end neural speech recognition model that directly recognizes words from acoustic signals, simplifying the traditional pipeline.
Findings
CTC word models outperform traditional sub-word models.
The model handles a vocabulary of about 100,000 words.
Training on 125,000 hours of data improves performance.
Abstract
We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsConnectionist Temporal Classification Loss · Sigmoid Activation · Tanh Activation · Long Short-Term Memory
