Exploring RNN-Transducer for Chinese Speech Recognition
Senmao Wang, Pan Zhou, Wei Chen, Jia Jia, Lei Xie

TL;DR
This paper investigates RNN-Transducer for Chinese speech recognition, proposing training improvements like learning rate decay and convolutional layers, achieving a 16.9% CER and surpassing previous models.
Contribution
The study introduces new training strategies for RNN-T, including learning rate decay and convolutional layers, simplifying training while maintaining high performance.
Findings
Achieved 16.9% CER on Chinese speech recognition
Proposed learning rate decay to accelerate convergence
Added convolutional layers to eliminate pre-training
Abstract
End-to-end approaches have drawn much attention recently for significantly simplifying the construction of an automatic speech recognition (ASR) system. RNN transducer (RNN-T) is one of the popular end-to-end methods. Previous studies have shown that RNN-T is difficult to train and a very complex training process is needed for a reasonable performance. In this paper, we explore RNN-T for a Chinese large vocabulary continuous speech recognition (LVCSR) task and aim to simplify the training process while maintaining performance. First, a new strategy of learning rate decay is proposed to accelerate the model convergence. Second, we find that adding convolutional layers at the beginning of the network and using ordered data can discard the pre-training process of the encoder without loss of performance. Besides, we design experiments to find a balance among the usage of GPU memory,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
