End-to-End Attention-based Large Vocabulary Speech Recognition

Dzmitry Bahdanau; Jan Chorowski; Dmitriy Serdyuk; Philemon Brakel,; Yoshua Bengio

arXiv:1508.04395·cs.CL·March 16, 2016

End-to-End Attention-based Large Vocabulary Speech Recognition

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel,, Yoshua Bengio

PDF

1 Repo

TL;DR

This paper introduces an end-to-end attention-based RNN model for large vocabulary speech recognition that directly predicts characters, replacing traditional HMM-based systems, and demonstrates comparable accuracy with improved efficiency.

Contribution

The paper presents a novel attention-based RNN architecture for speech recognition that eliminates the need for HMMs and introduces methods to speed up sequence alignment.

Findings

01

Achieves recognition accuracy comparable to HMM-based systems

02

Proposes efficient attention mechanisms for faster decoding

03

Demonstrates effective integration of language models

Abstract

Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the desired character sequence is learned automatically by an attention mechanism built into the RNN. For each predicted character, the attention mechanism scans the input sequence and chooses relevant frames. We propose two methods to speed up this operation: limiting the scan to a subset of most promising frames and pooling over time the information contained in neighboring frames,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rizar/attention-lvcsr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings