Listen, Attend and Spell
William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals

TL;DR
This paper introduces Listen, Attend and Spell (LAS), a neural network model for speech recognition that jointly learns to transcribe speech directly to characters, outperforming previous end-to-end models without relying on traditional HMM components.
Contribution
LAS is a novel end-to-end neural network architecture that jointly learns speech transcription with an attention mechanism, eliminating the need for separate acoustic and language models.
Findings
LAS achieves 14.1% WER without language models.
LAS reduces WER to 10.3% with language model rescoring.
Compared to traditional models, LAS simplifies the speech recognition pipeline.
Abstract
We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Humor Studies and Applications
