Component-Aware Self-Speculative Decoding in Hybrid Language Models
Hector Borobia, Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o

TL;DR
This paper introduces component-aware self-speculative decoding for hybrid language models, exploiting architectural heterogeneity to improve decoding efficiency and acceptance rates.
Contribution
It is the first method to leverage internal architectural heterogeneity in hybrid models for self-speculative decoding, demonstrating significant performance differences based on architecture.
Findings
Parallel hybrids achieve acceptance rate alpha=0.68 at draft length 2.
Sequential hybrids achieve acceptance rate alpha=0.038 at draft length 2.
Perplexity degradation predicts speculative viability without decoding.
Abstract
Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
