Improving RNN transducer with normalized jointer network
Mingkun Huang, Jun Zhang, Meng Cai, Yang Zhang, Jiali Yao, Yongbin, You, Yi He, Zejun Ma

TL;DR
This paper introduces a normalized jointer network to reduce gradient variance in RNN transducer training, combined with enhanced encoder and predictor networks, leading to state-of-the-art speech recognition results on multiple datasets.
Contribution
The paper proposes a novel normalized jointer network to address gradient variance issues in RNN-T training and combines it with improved network architectures for better performance.
Findings
Achieved state-of-the-art CER on AISHELL-1 benchmark.
Reduced gradient variance in RNN-T training.
Improved recognition accuracy on large-scale industrial data.
Abstract
Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) model in automatic speech recognition (ASR). It has shown superior performance compared to traditional hybrid ASR systems. However, training RNN-T from scratch is still challenging. We observe a huge gradient variance during RNN-T training and suspect it hurts the performance. In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it. We also propose to enhance the RNN-T network with a modified conformer encoder network and transformer-XL predictor networks to achieve the best performance. Experiments are conducted on the open 170-hour AISHELL-1 and industrial-level 30000-hour mandarin speech dataset. On the AISHELL-1 dataset, our RNN-T system gets state-of-the-art results on AISHELL-1's streaming and non-streaming benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Adaptive Input Representations · Linear Warmup With Cosine Annealing · Transformer-XL · Cosine Annealing · Residual Connection · Adam
