Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models   on GPU

Daniel Galvez; Vladimir Bataev; Hainan Xu; Tim Kaldewey

arXiv:2406.03791·cs.LG·June 7, 2024

Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

Daniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey

PDF

Open Access

TL;DR

This paper introduces a GPU-based exact greedy decoding method for RNN-T speech recognition models that significantly reduces inference time and GPU idle periods, enabling high-throughput performance for large models.

Contribution

The authors present a novel CUDA-based implementation of greedy decoding that eliminates GPU idle time, achieving substantial speedups for large RNN-T models.

Findings

01

Speedup of 2.5x for a 1.1B parameter RNN-T model

02

Achieved near-CTC model inference speed for large RNN-T models

03

Applicable to label looping greedy decoding with notable speed improvements

Abstract

The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the "label looping" alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings