Improving the fusion of acoustic and text representations in RNN-T

Chao Zhang; Bo Li; Zhiyun Lu; Tara N. Sainath; Shuo-yiin Chang

arXiv:2201.10240·eess.AS·January 26, 2022

Improving the fusion of acoustic and text representations in RNN-T

Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang

PDF

Open Access

TL;DR

This paper enhances RNN-T for streaming speech recognition by introducing gating and bilinear pooling in the joint network, along with regularization, leading to 4-5% WER improvements across nine languages.

Contribution

It proposes novel gating and bilinear pooling methods in the RNN-T joint network and a regularization technique to improve training and recognition accuracy.

Findings

01

Achieved 4-5% relative WER reduction across nine languages.

02

Enhanced the expressiveness of the joint network with minimal additional parameters.

03

Demonstrated effective training improvements with the proposed regularization.

Abstract

The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR). To estimate the output distributions over subword units, RNN-T uses a fully connected layer as the joint network to fuse the acoustic representations extracted using the acoustic encoder with the text representations obtained using the prediction network based on the previous subword units. In this paper, we propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations to feed into the output layer. A regularisation method is also proposed to enable better acoustic encoder training by reducing the gradients back-propagated into the prediction network at the beginning of RNN-T training. Experimental results on a multilingual ASR setting for voice search over nine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing