SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Linye Wei; Shuzhang Zhong; Songqiang Xu; Runsheng Wang; Ru Huang; Meng Li

arXiv:2507.18181·eess.AS·July 29, 2025

SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li

PDF

TL;DR

SpecASR is a specialized speculative decoding framework for large language model-based automatic speech recognition that significantly reduces latency while maintaining high accuracy by leveraging audio-conditioned decoding and innovative strategies.

Contribution

The paper introduces SpecASR, a novel decoding framework tailored for ASR that exploits output alignment and adaptive techniques to improve speed without accuracy loss.

Findings

01

Achieves 3.04x-3.79x speedup over autoregressive decoding.

02

Achieves 1.25x-1.84x speedup over existing speculative decoding.

03

Maintains recognition accuracy despite speed improvements.

Abstract

Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.