Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Moran Yanuka; Paul Dixon; Eyal Finkelshtein; Daniel Rotman; Raja Giryes

arXiv:2511.13732·eess.AS·January 23, 2026

Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes

PDF

Open Access

TL;DR

This paper introduces Principled Coarse-Graining (PCG), a novel method for speculative decoding in speech generation that improves speed by verifying proposals at the acoustic similarity group level, maintaining speech quality.

Contribution

The paper proposes PCG, a new group-level acceptance method for speculative decoding in speech LLMs, enhancing speed while preserving speech quality and intelligibility.

Findings

01

Increased acceptance and throughput on LibriTTS

02

Maintained speech intelligibility and speaker similarity

03

Outperformed standard speculative decoding methods

Abstract

Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model's embedding space. By splitting each token's probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis