Improving RNN Transducer Based ASR with Auxiliary Tasks
Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey, Zweig

TL;DR
This paper explores auxiliary tasks to enhance RNN transducer-based end-to-end speech recognition, demonstrating consistent improvements across multiple languages and achieving competitive results on LibriSpeech benchmarks.
Contribution
It introduces two auxiliary tasks for RNN-T models, improving accuracy and deep transformer encoder learning, with demonstrated benefits on multilingual social media data and standard benchmarks.
Findings
Both auxiliary tasks improve ASR accuracy across languages.
Auxiliary tasks help RNN-T models learn better deep transformer encoders.
Achieved 2.0%/4.2% WER on LibriSpeech test sets, competitive with top models.
Abstract
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing auxiliary tasks. We propose (i) using the same auxiliary task as primary RNN-T ASR task, and (ii) performing context-dependent graphemic state prediction as in conventional hybrid modeling. In transcribing social media videos with varying training data size, we first evaluate the streaming ASR performance on three languages: Romanian, Turkish and German. We find that both proposed methods provide consistent improvements. Next, we observe that both auxiliary tasks demonstrate efficacy in learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
