Improving RNN Transducer Based ASR with Auxiliary Tasks

Chunxi Liu; Frank Zhang; Duc Le; Suyoun Kim; Yatharth Saraf; Geoffrey; Zweig

arXiv:2011.03109·cs.CL·November 10, 2020·1 cites

Improving RNN Transducer Based ASR with Auxiliary Tasks

Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey, Zweig

PDF

Open Access 1 Repo

TL;DR

This paper explores auxiliary tasks to enhance RNN transducer-based end-to-end speech recognition, demonstrating consistent improvements across multiple languages and achieving competitive results on LibriSpeech benchmarks.

Contribution

It introduces two auxiliary tasks for RNN-T models, improving accuracy and deep transformer encoder learning, with demonstrated benefits on multilingual social media data and standard benchmarks.

Findings

01

Both auxiliary tasks improve ASR accuracy across languages.

02

Auxiliary tasks help RNN-T models learn better deep transformer encoders.

03

Achieved 2.0%/4.2% WER on LibriSpeech test sets, competitive with top models.

Abstract

End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing auxiliary tasks. We propose (i) using the same auxiliary task as primary RNN-T ASR task, and (ii) performing context-dependent graphemic state prediction as in conventional hybrid modeling. In transcribing social media videos with varying training data size, we first evaluate the streaming ASR performance on three languages: Romanian, Turkish and German. We find that both proposed methods provide consistent improvements. Next, we observe that both auxiliary tasks demonstrate efficacy in learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

upskyy/Transformer-Transducer
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing