Intermediate-layer output Regularization for Attention-based Speech   Recognition with Shared Decoder

Jicheng Zhang; Yizhou Peng; Haihua Xu; Yi He; Eng Siong Chng; Hao; Huang

arXiv:2207.04177·eess.AS·July 12, 2022·1 cites

Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Jicheng Zhang, Yizhou Peng, Haihua Xu, Yi He, Eng Siong Chng, Hao, Huang

PDF

Open Access

TL;DR

This paper introduces a novel intermediate-layer output regularization method for attention-based speech recognition, where intermediate encoder outputs are directly fed into the decoder during training, enhancing model performance without extra overhead.

Contribution

The proposed method directly uses intermediate encoder outputs as decoder inputs during training, providing a more efficient regularization approach for attention-based speech recognition.

Findings

01

Improved recognition accuracy over conventional ILO-based CTC methods

02

Enhanced performance compared to original attention-based models

03

Regularized training leads to more effective encoder-decoder learning

Abstract

Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks. In this paper, we propose a novel method to do ILO regularized training differently. Instead of using conventional multitask methods that entail more training overhead, we directly make the intermediate layer output as input to the decoder, that is, our decoder not only accepts the output of the final encoder layer as input, it also takes the output of the encoder ILO as input during training. With the proposed method, as both encoder and decoder are simultaneously "regularized", the network is more sufficiently trained, consistently leading to improved results, over the ILO-based CTC method, as well as over the original attention-based modeling method without the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing