Internal Language Model Estimation for Domain-Adaptive End-to-End Speech   Recognition

Zhong Meng; Sarangarajan Parthasarathy; Eric Sun; Yashesh Gaur,; Naoyuki Kanda; Liang Lu; Xie Chen; Rui Zhao; Jinyu Li; Yifan Gong

arXiv:2011.01991·eess.AS·November 5, 2020

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur,, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong

PDF

Open Access

TL;DR

This paper introduces an internal language model estimation method that improves external LM integration in end-to-end speech recognition models, reducing domain mismatch and enhancing accuracy without additional training.

Contribution

The proposed ILME method estimates and subtracts internal LM scores from E2E models, enabling better external LM integration across various architectures without extra training.

Findings

01

Achieved up to 15.5% WER reduction on LibriSpeech

02

Achieved up to 6.8% WER reduction on Microsoft test sets

03

Effective domain adaptation for E2E ASR models

Abstract

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing