Improved Multi-Stage Training of Online Attention-based Encoder-Decoder   Models

Abhinav Garg; Dhananjaya Gowda; Ankur Kumar; Kwangyoun Kim; Mehul; Kumar; Chanwoo Kim

arXiv:1912.12384·eess.AS·January 1, 2020·1 cites

Improved Multi-Stage Training of Online Attention-based Encoder-Decoder Models

Abhinav Garg, Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Mehul, Kumar, Chanwoo Kim

PDF

Open Access

TL;DR

This paper introduces a refined multi-stage, multi-task training approach for online attention-based encoder-decoder models, significantly improving speech recognition accuracy on Librispeech data.

Contribution

It proposes a novel three-stage training strategy combined with multi-task learning and transfer learning, enhancing model performance over previous methods.

Findings

01

35% and 10% relative WER reduction for smaller and bigger models

02

Achieved WER of 5.04% and 4.48% on Librispeech test-clean

03

Effective use of transfer learning and multi-task training strategies

Abstract

In this paper, we propose a refined multi-stage multi-task training strategy to improve the performance of online attention-based encoder-decoder (AED) models. A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed. Also, multi-task learning based on two-levels of linguistic granularity namely, character and BPE, is used. We explore different pre-training strategies for the encoders including transfer learning from a bidirectional encoder. Our encoder-decoder models with online attention show 35% and 10% relative improvement over their baselines for smaller and bigger models, respectively. Our models achieve a word error rate (WER) of 5.04% and 4.48% on the Librispeech test-clean data for the smaller and bigger models respectively after fusion with long short-term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsByte Pair Encoding