Text Injection for Neural Contextual Biasing
Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang,, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

TL;DR
This paper introduces a novel method called contextual text injection (CTI) that uses large unpaired text corpora to improve neural contextual biasing in speech recognition, significantly reducing word error rates.
Contribution
The work proposes CTI and CTI-MWER, novel techniques that leverage unpaired text data and MWER training to enhance biasing effectiveness in ASR models.
Findings
Up to 43.3% relative WER reduction with 100 billion text sentences.
CTI-MWER further improves WER by 23.5%.
Effective use of unpaired text data for biasing in neural ASR.
Abstract
Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
