Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition
Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu,, Yifan Gong

TL;DR
This paper introduces a method for adapting end-to-end speech recognition models using only text data by fine-tuning their internal language model components, achieving significant WER improvements without extra inference costs.
Contribution
The authors propose internal LM adaptation (ILMA) that fine-tunes internal components of E2E models using text-only data, eliminating the need for external language models during inference.
Findings
ILMA achieves up to 34.9% relative WER reduction.
Effective when only updating the last linear layer.
Requires training with an internal LM loss for best results.
Abstract
Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsLinear Layer
