Multi-blank Transducers for Speech Recognition

Hainan Xu; Fei Jia; Somshubra Majumdar; Shinji Watanabe; Boris; Ginsburg

arXiv:2211.03541·eess.AS·April 15, 2024

Multi-blank Transducers for Speech Recognition

Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris, Ginsburg

PDF

Open Access 4 Repos

TL;DR

This paper introduces multi-blank RNN-T models for speech recognition, which emit larger blank symbols to significantly speed up inference and improve accuracy across multiple languages.

Contribution

It proposes a novel multi-blank emission method and a logit under-normalization training technique for RNN-T models, enhancing speed and accuracy.

Findings

01

Speedup of over 90% in inference time for English Librispeech

02

Speedup of over 139% for German Multilingual Librispeech

03

Consistent improvement in ASR accuracy across datasets

Abstract

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing