Integrating Knowledge into End-to-End Speech Recognition from External   Text-Only Data

Ye Bai; Jiangyan Yi; Jianhua Tao; Zhengqi Wen; Zhengkun Tian; Shuai; Zhang

arXiv:1912.01777·eess.AS·March 17, 2021

Integrating Knowledge into End-to-End Speech Recognition from External Text-Only Data

Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Zhengkun Tian, Shuai, Zhang

PDF

Open Access

TL;DR

This paper introduces LST, a novel method to incorporate external text-only data into end-to-end speech recognition models, enabling the use of full sentence context without increasing inference complexity.

Contribution

The paper proposes a two-stage teacher-student learning approach and a causal cloze language model to effectively integrate external text data and utilize sentence context in AED models.

Findings

01

Improved recognition accuracy on Chinese datasets AISHELL-1 and AISHELL-2.

02

Effective use of external text-only data without added inference complexity.

03

Leveraging full sentence context enhances model performance.

Abstract

Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because of the end-to-end training, an AED model is usually trained with speech-text paired data. It is challenging to incorporate external text-only data into AED models. Another issue of the AED model is that it does not use the right context of a text token while predicting the token. To alleviate the above two issues, we propose a unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence. The method is divided into two stages. First, in the representation stage, a language model is trained on the text. It can be seen as that the knowledge in the text is compressed into the LM. Then, at the transferring stage, the knowledge is transferred to the AED model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence