TL;DR
This paper introduces an end-to-end attention-based RNN model for large vocabulary speech recognition that directly predicts characters, replacing traditional HMM-based systems, and demonstrates comparable accuracy with improved efficiency.
Contribution
The paper presents a novel attention-based RNN architecture for speech recognition that eliminates the need for HMMs and introduces methods to speed up sequence alignment.
Findings
Achieves recognition accuracy comparable to HMM-based systems
Proposes efficient attention mechanisms for faster decoding
Demonstrates effective integration of language models
Abstract
Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the desired character sequence is learned automatically by an attention mechanism built into the RNN. For each predicted character, the attention mechanism scans the input sequence and chooses relevant frames. We propose two methods to speed up this operation: limiting the scan to a subset of most promising frames and pooling over time the information contained in neighboring frames,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
