End-to-end Continuous Speech Recognition using Attention-based Recurrent   NN: First Results

Jan Chorowski; Dzmitry Bahdanau; Kyunghyun Cho; Yoshua Bengio

arXiv:1412.1602·cs.NE·December 5, 2014·416 cites

End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results

Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

PDF

Open Access

TL;DR

This paper introduces an end-to-end attention-based recurrent neural network model for continuous speech recognition, replacing traditional HMMs, and demonstrates comparable phoneme error rates on the TIMIT dataset.

Contribution

It presents the first application of an attention-based RNN encoder-decoder for continuous speech recognition, eliminating the need for HMMs.

Findings

01

Achieves phoneme error rates comparable to state-of-the-art HMM-based systems.

02

Demonstrates the feasibility of end-to-end neural models for speech recognition.

03

Uses attention mechanism for input-output alignment in speech decoding.

Abstract

We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing