Learning Shared Encoding Representation for End-to-End Speech   Recognition Models

Thai-Son Nguyen; Sebastian Stueker; Alex Waibel

arXiv:1904.02147·eess.AS·April 4, 2019·1 cites

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

PDF

Open Access

TL;DR

This paper introduces a multi-task learning approach to develop a shared encoding for end-to-end speech recognition, improving model performance and initializing attention-based models effectively.

Contribution

It proposes a multi-task training method for shared encoding in speech recognition and demonstrates its effectiveness in initializing attention-based models.

Findings

01

Multi-task training improves CTC model optimization.

02

Shared encoding enhances speech recognition accuracy.

03

Initialization with shared encoding reduces WER on benchmarks.

Abstract

In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to the plain-task training with an optimal setup. Furthermore, we propose to use the encoding representation learned by the multi-task network to initialize the encoder of attention-based models. Thereby, we train a deep attention-based end-to-end model with 10 long short-term memory (LSTM) layers of encoder which produces 12.2\% and 22.6\% word-error-rate on Switchboard and CallHome subsets of the Hub5 2000 evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing