End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results
Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

TL;DR
This paper introduces an end-to-end attention-based recurrent neural network model for continuous speech recognition, replacing traditional HMMs, and demonstrates comparable phoneme error rates on the TIMIT dataset.
Contribution
It presents the first application of an attention-based RNN encoder-decoder for continuous speech recognition, eliminating the need for HMMs.
Findings
Achieves phoneme error rates comparable to state-of-the-art HMM-based systems.
Demonstrates the feasibility of end-to-end neural models for speech recognition.
Uses attention mechanism for input-output alignment in speech decoding.
Abstract
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
