Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition
Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric, Sun, Jinyu Li, Yifan Gong

TL;DR
This paper introduces a novel minimum WER training method with internal LM estimation fusion for end-to-end speech recognition, reducing the need for LM weight tuning and improving WER across multiple test sets.
Contribution
It proposes MWER-ILME, a new training approach that integrates internal LM estimation into MWER training, enhancing domain robustness without extensive LM weight tuning.
Findings
MWER-ILME achieves 8.8% relative WER reduction over MWER.
MWER-ILME outperforms MWER-SF with 5.8% WER reduction.
Robust LM integration across diverse test sets.
Abstract
Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
