Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU
Daniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey

TL;DR
This paper introduces a GPU-based exact greedy decoding method for RNN-T speech recognition models that significantly reduces inference time and GPU idle periods, enabling high-throughput performance for large models.
Contribution
The authors present a novel CUDA-based implementation of greedy decoding that eliminates GPU idle time, achieving substantial speedups for large RNN-T models.
Findings
Speedup of 2.5x for a 1.1B parameter RNN-T model
Achieved near-CTC model inference speed for large RNN-T models
Applicable to label looping greedy decoding with notable speed improvements
Abstract
The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the "label looping" alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
