Improving RNN transducer with normalized jointer network

Mingkun Huang; Jun Zhang; Meng Cai; Yang Zhang; Jiali Yao; Yongbin; You; Yi He; Zejun Ma

arXiv:2011.01576·eess.AS·November 4, 2020·5 cites

Improving RNN transducer with normalized jointer network

Mingkun Huang, Jun Zhang, Meng Cai, Yang Zhang, Jiali Yao, Yongbin, You, Yi He, Zejun Ma

PDF

Open Access

TL;DR

This paper introduces a normalized jointer network to reduce gradient variance in RNN transducer training, combined with enhanced encoder and predictor networks, leading to state-of-the-art speech recognition results on multiple datasets.

Contribution

The paper proposes a novel normalized jointer network to address gradient variance issues in RNN-T training and combines it with improved network architectures for better performance.

Findings

01

Achieved state-of-the-art CER on AISHELL-1 benchmark.

02

Reduced gradient variance in RNN-T training.

03

Improved recognition accuracy on large-scale industrial data.

Abstract

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) model in automatic speech recognition (ASR). It has shown superior performance compared to traditional hybrid ASR systems. However, training RNN-T from scratch is still challenging. We observe a huge gradient variance during RNN-T training and suspect it hurts the performance. In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it. We also propose to enhance the RNN-T network with a modified conformer encoder network and transformer-XL predictor networks to achieve the best performance. Experiments are conducted on the open 170-hour AISHELL-1 and industrial-level 30000-hour mandarin speech dataset. On the AISHELL-1 dataset, our RNN-T system gets state-of-the-art results on AISHELL-1's streaming and non-streaming benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Adaptive Input Representations · Linear Warmup With Cosine Annealing · Transformer-XL · Cosine Annealing · Residual Connection · Adam