Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting
Tuan Vu Ho, Hiroaki Kokubo, Masaaki Yamamoto, and Yohei Kawaguchi

TL;DR
This paper introduces a model-free speculative decoding method for transformer-based ASR that uses a precomputed token map to accelerate inference on resource-limited devices without sacrificing accuracy.
Contribution
It proposes Token Map Drafting, a novel approach that eliminates the need for a separate draft model in speculative decoding, enabling faster on-device speech recognition.
Findings
Achieves 1.27x and 1.37x speed-ups on datasets without accuracy loss.
Outperforms the Distill-spec baseline by 10% in decoding speed on CPU.
Effective for low-perplexity, structured domains.
Abstract
End-to-end automatic speech recognition (ASR) systems based on transformer architectures, such as Whisper, offer high transcription accuracy and robustness. However, their autoregressive decoding is computationally expensive, hence limiting deployment on CPU-based and resource-constrained devices. Speculative decoding (SD) mitigates this issue by using a smaller draft model to propose candidate tokens, which are then verified by the main model. However, this approach is impractical for devices lacking hardware accelerators like GPUs. To address this, we propose \emph{Token Map Drafting}, a model-free SD technique that eliminates the need for a separate draft model. Instead, we leverage a precomputed n-gram token map derived from domain-specific training data, enabling efficient speculative decoding with minimal overhead. Our method significantly accelerates ASR inference in structured,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
