Integrating Text Inputs For Training and Adapting RNN Transducer ASR   Models

Samuel Thomas; Brian Kingsbury; George Saon; Hong-Kwang J. Kuo

arXiv:2202.13155·cs.CL·March 1, 2022

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

PDF

Open Access

TL;DR

This paper introduces a new text-based training method for RNN Transducer ASR models that enables effective domain adaptation using only text data, significantly reducing word error rates across multiple datasets.

Contribution

It presents a novel text representation and training framework that allows RNN-T models to adapt their internal language model component with only text data, improving flexibility and customization.

Findings

01

13% WER reduction on Switchboard and CallHome datasets

02

20-45% relative WER reduction in domain-specific adaptations

03

Effective domain adaptation using only unpaired text data

Abstract

Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data. An RNN-T model trained using both speech and text inputs improves over a baseline model trained on just speech with close to 13% word error rate (WER) reduction on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation. The usefulness of the proposed approach is further demonstrated by customizing this general purpose RNN-T model to three separate datasets. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing