Towards better decoding and language model integration in sequence to sequence models
Jan Chorowski, Navdeep Jaitly

TL;DR
This paper analyzes attention-based sequence-to-sequence speech recognition, identifies key shortcomings, and proposes practical solutions that improve transcription accuracy with and without language models.
Contribution
It introduces solutions to overconfidence and incomplete transcriptions in seq2seq speech recognition, achieving competitive WER on the WSJ dataset.
Findings
Achieved 10.6% WER without language models
Reduced WER to 6.7% with trigram language model
Identified and addressed overconfidence and incompleteness issues
Abstract
The recently proposed Sequence-to-Sequence (seq2seq) framework advocates replacing complex data processing pipelines, such as an entire automatic speech recognition system, with a single neural network trained in an end-to-end fashion. In this contribution, we analyse an attention-based seq2seq speech recognition system that directly transcribes recordings into characters. We observe two shortcomings: overconfidence in its predictions and a tendency to produce incomplete transcriptions when language models are used. We propose practical solutions to both problems achieving competitive speaker independent word error rates on the Wall Street Journal dataset: without separate language models we reach 10.6% WER, while together with a trigram language model, we reach 6.7% WER.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
