TL;DR
This paper introduces a GPU-accelerated Viterbi lattice decoder that significantly improves speed and memory efficiency for speech recognition, enabling real-time streaming and large graph processing on diverse hardware.
Contribution
The paper presents a novel GPU-based Viterbi decoder with optimized memory, I/O, and parallelism, outperforming existing decoders in speed and scalability for speech recognition tasks.
Findings
Up to 240x speedup over single-core CPU decoding
Up to 40x faster than current state-of-the-art GPU decoders
Supports larger graphs and multiple streams efficiently
Abstract
We present an optimized weighted finite-state transducer (WFST) decoder capable of online streaming and offline batch processing of audio using Graphics Processing Units (GPUs). The decoder is efficient in memory utilization, input/output (I/O) bandwidth, and uses a novel Viterbi implementation designed to maximize parallelism. The reduced memory footprint allows the decoder to process significantly larger graphs than previously possible, while optimizing I/O increases the number of simultaneous streams supported. GPU preprocessing of lattice segments enables intermediate lattice results to be returned to the requestor during streaming inference. Collectively, the proposed algorithm yields up to a 240x speedup over single core CPU decoding, and up to 40x faster decoding than the current state-of-the-art GPU decoder, while returning equivalent results. This decoder design enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
