Multi-blank Transducers for Speech Recognition
Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris, Ginsburg

TL;DR
This paper introduces multi-blank RNN-T models for speech recognition, which emit larger blank symbols to significantly speed up inference and improve accuracy across multiple languages.
Contribution
It proposes a novel multi-blank emission method and a logit under-normalization training technique for RNN-T models, enhancing speed and accuracy.
Findings
Speedup of over 90% in inference time for English Librispeech
Speedup of over 139% for German Multilingual Librispeech
Consistent improvement in ASR accuracy across datasets
Abstract
This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
