JEIT: Joint End-to-End Model and Internal Language Model Training for   Speech Recognition

Zhong Meng; Weiran Wang; Rohit Prabhavalkar; Tara N. Sainath; Tongzhou; Chen; Ehsan Variani; Yu Zhang; Bo Li; Andrew Rosenberg; Bhuvana Ramabhadran

arXiv:2302.08583·eess.AS·February 20, 2023

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou, Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

PDF

Open Access

TL;DR

This paper introduces JEIT, a joint training method that integrates unpaired text into speech recognition models to enhance rare-word accuracy without extra adaptation steps.

Contribution

The paper proposes JEIT, a novel joint training approach that incorporates unpaired text into the internal language model during end-to-end speech recognition training, improving rare-word recognition.

Findings

01

JEIT improves rare-word recognition by up to 16.4%.

02

Modular hybrid autoregressive transducer (MHAT) outperforms HAT in JEIT.

03

CJJT further enhances performance by combining JEIT with modality matching.

Abstract

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing