Learning Shared Encoding Representation for End-to-End Speech Recognition Models
Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

TL;DR
This paper introduces a multi-task learning approach to develop a shared encoding for end-to-end speech recognition, improving model performance and initializing attention-based models effectively.
Contribution
It proposes a multi-task training method for shared encoding in speech recognition and demonstrates its effectiveness in initializing attention-based models.
Findings
Multi-task training improves CTC model optimization.
Shared encoding enhances speech recognition accuracy.
Initialization with shared encoding reduces WER on benchmarks.
Abstract
In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to the plain-task training with an optimal setup. Furthermore, we propose to use the encoding representation learned by the multi-task network to initialize the encoder of attention-based models. Thereby, we train a deep attention-based end-to-end model with 10 long short-term memory (LSTM) layers of encoder which produces 12.2\% and 22.6\% word-error-rate on Switchboard and CallHome subsets of the Hub5 2000 evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing
