Deep Shallow Fusion for RNN-T Personalization
Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen,, Michael L. Seltzer

TL;DR
This paper introduces novel deep fusion techniques to enhance RNN-T speech recognition models' ability to personalize, especially for rare words and entities, achieving significant WER improvements over baseline models.
Contribution
The work presents new methods for RNN-T personalization, including modeling rare WordPieces, integrating external info, and deep fusion with personalized language models.
Findings
Achieved 15.4%-34.5% relative WER reduction.
Enhanced recognition of rare words and entities.
Close gap with hybrid systems on biasing tasks.
Abstract
End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the lack of external language models and difficulties in recognizing rare long-tail words, specifically entity names. In this work, we present novel techniques to improve RNN-T's ability to model rare WordPieces, infuse extra information into the encoder, enable the use of alternative graphemic pronunciations, and perform deep fusion with personalized language models for more robust biasing. We show that these combined techniques result in 15.4%-34.5% relative Word Error Rate improvement compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
