CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

Zhiyuan Ning; Jiawei Shao; Ruge Xu; Xinfei Guo; Jun Zhang; Chi Zhang; Xuelong Li

arXiv:2510.26843·cs.LG·November 3, 2025

CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

Zhiyuan Ning, Jiawei Shao, Ruge Xu, Xinfei Guo, Jun Zhang, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

CAS-Spec introduces a novel adaptive self-speculative decoding method that leverages dynamic strategies and hierarchical routing to significantly accelerate large language model inference without retraining models.

Contribution

The paper proposes CAS-Spec, a new method combining dynamic inference acceleration and adaptive routing to improve on-the-fly speculative decoding efficiency for LLMs.

Findings

01

Achieves 1.1x to 2.3x speedup over autoregressive decoding.

02

DyTC algorithm improves speedup by 47% and 48% over baseline methods.

03

Easily integrates into existing LLMs for enhanced inference acceleration.

Abstract

Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods