Minimum Word Error Rate Training with Language Model Fusion for   End-to-End Speech Recognition

Zhong Meng; Yu Wu; Naoyuki Kanda; Liang Lu; Xie Chen; Guoli Ye; Eric; Sun; Jinyu Li; Yifan Gong

arXiv:2106.02302·eess.AS·June 7, 2021

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric, Sun, Jinyu Li, Yifan Gong

PDF

Open Access

TL;DR

This paper introduces a novel minimum WER training method with internal LM estimation fusion for end-to-end speech recognition, reducing the need for LM weight tuning and improving WER across multiple test sets.

Contribution

It proposes MWER-ILME, a new training approach that integrates internal LM estimation into MWER training, enhancing domain robustness without extensive LM weight tuning.

Findings

01

MWER-ILME achieves 8.8% relative WER reduction over MWER.

02

MWER-ILME outperforms MWER-SF with 5.8% WER reduction.

03

Robust LM integration across diverse test sets.

Abstract

Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques