Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts

George Saon; Samuel Thomas; Takashi Fukuda; Tohru Nagano; Avihu Dekel; Luis Lastras

arXiv:2603.11243·eess.AS·March 13, 2026

Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts

George Saon, Samuel Thomas, Takashi Fukuda, Tohru Nagano, Avihu Dekel, Luis Lastras

PDF

Open Access 1 Models

TL;DR

This paper introduces self-speculative decoding for speech recognition using LLMs and CTC encoders, significantly speeding up inference and reducing word error rate across multiple languages.

Contribution

It presents a novel three-step decoding method combining CTC and LLMs to accelerate ASR inference and improve accuracy, with extensive multi-language experiments.

Findings

01

Achieves 5.58% WER on HuggingFace Open ASR benchmark

02

Speeds up decoding by 4.4 times with minimal WER increase

03

Demonstrates effectiveness across nine corpora and five languages

Abstract

We propose self-speculative decoding for speech-aware LLMs by using the CTC encoder as a draft model to accelerate auto-regressive (AR) inference and improve ASR accuracy. Our three-step procedure works as follows: (1) if the frame entropies of the CTC output distributions are below a threshold, the greedy CTC hypothesis is accepted as final; (2) otherwise, the CTC hypothesis is verified in a single LLM forward pass using a relaxed acceptance criterion based on token likelihoods; (3) if verification fails, AR decoding resumes from the accepted CTC prefix. Experiments on nine corpora and five languages show that this approach can simultaneously accelerate decoding and reduce WER. On the HuggingFace Open ASR benchmark with a 1B parameter LLM and 440M parameter CTC encoder, we achieve a record 5.58% WER and improve the inverse real time factor by a factor of 4.4 with only a 12% relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ibm-granite/granite-4.0-1b-speech
model· 59k dl· ♡ 211
59k dl♡ 211

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling