Improved training of end-to-end attention models for speech recognition

Albert Zeyer; and Kazuki Irie; and Ralf Schl\"uter; and Hermann Ney

arXiv:1805.03294·cs.CL·August 6, 2019

Improved training of end-to-end attention models for speech recognition

Albert Zeyer, and Kazuki Irie, and Ralf Schl\"uter, and Hermann Ney

PDF

5 Repos

TL;DR

This paper improves end-to-end attention models for speech recognition by introducing a new pretraining scheme, achieving state-of-the-art results on LibriSpeech, and demonstrating benefits of auxiliary losses and language model fusion.

Contribution

It presents a novel pretraining approach with dynamic time reduction and explores auxiliary CTC loss and language model fusion to enhance speech recognition performance.

Findings

01

Achieved state-of-the-art WER of 3.54% on LibriSpeech dev-clean

02

Pretraining with high to low time reduction improves convergence

03

Shallow fusion with LSTM language models yields 27% relative WER reduction

Abstract

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConnectionist Temporal Classification Loss