NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding
Vladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg

TL;DR
NGPU-LM is a GPU-optimized n-gram language model that significantly improves context-biasing efficiency in greedy ASR decoding, reducing accuracy gaps with minimal computational overhead.
Contribution
This work introduces NGPU-LM, a novel GPU-accelerated data structure for n-gram models enabling fast, parallel inference across various ASR architectures with minimal overhead.
Findings
Reduces over 50% of the accuracy gap between greedy and beam search in out-of-domain scenarios.
Achieves less than 7% additional computational overhead for customizable greedy decoding.
Provides an open-source implementation for industrial and research use.
Abstract
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
