U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition
Di Wu, Binbin Zhang, Chao Yang, Zhendong Peng, Wenjing Xia, Xiaoyu, Chen, Xin Lei

TL;DR
U2++ is an improved speech recognition model that leverages bidirectional training and decoding, along with a new data augmentation method, achieving state-of-the-art accuracy and robustness in streaming and non-streaming settings.
Contribution
The paper introduces U2++, a novel bidirectional end-to-end speech recognition model that enhances U2 with richer training information and a new data augmentation technique.
Findings
Achieves 4.63% CER non-streaming on AISHELL-1
Achieves 5.05% CER streaming with 320ms latency on AISHELL-1
Outperforms previous models with 5-8% WER reduction
Abstract
The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5\% - 8\% word error rate reduction gain over U2. On the experiment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
