Component-Aware Self-Speculative Decoding in Hybrid Language Models

Hector Borobia; Elies Segu\'i-Mas; Guillermina Tormo-Carb\'o

arXiv:2605.01106·cs.CL·May 5, 2026

Component-Aware Self-Speculative Decoding in Hybrid Language Models

Hector Borobia, Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o

PDF

TL;DR

This paper introduces component-aware self-speculative decoding for hybrid language models, exploiting architectural heterogeneity to improve decoding efficiency and acceptance rates.

Contribution

It is the first method to leverage internal architectural heterogeneity in hybrid models for self-speculative decoding, demonstrating significant performance differences based on architecture.

Findings

01

Parallel hybrids achieve acceptance rate alpha=0.68 at draft length 2.

02

Sequential hybrids achieve acceptance rate alpha=0.038 at draft length 2.

03

Perplexity degradation predicts speculative viability without decoding.

Abstract

Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.