SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li

TL;DR
SpecASR is a specialized speculative decoding framework for large language model-based automatic speech recognition that significantly reduces latency while maintaining high accuracy by leveraging audio-conditioned decoding and innovative strategies.
Contribution
The paper introduces SpecASR, a novel decoding framework tailored for ASR that exploits output alignment and adaptive techniques to improve speed without accuracy loss.
Findings
Achieves 3.04x-3.79x speedup over autoregressive decoding.
Achieves 1.25x-1.84x speedup over existing speculative decoding.
Maintains recognition accuracy despite speed improvements.
Abstract
Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
