Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models
Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

TL;DR
This paper introduces a new text-based training method for RNN Transducer ASR models that enables effective domain adaptation using only text data, significantly reducing word error rates across multiple datasets.
Contribution
It presents a novel text representation and training framework that allows RNN-T models to adapt their internal language model component with only text data, improving flexibility and customization.
Findings
13% WER reduction on Switchboard and CallHome datasets
20-45% relative WER reduction in domain-specific adaptations
Effective domain adaptation using only unpaired text data
Abstract
Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data. An RNN-T model trained using both speech and text inputs improves over a baseline model trained on just speech with close to 13% word error rate (WER) reduction on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation. The usefulness of the proposed approach is further demonstrated by customizing this general purpose RNN-T model to three separate datasets. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
