CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
Zhiyuan Ning, Jiawei Shao, Ruge Xu, Xinfei Guo, Jun Zhang, Chi Zhang, Xuelong Li

TL;DR
CAS-Spec introduces a novel adaptive self-speculative decoding method that leverages dynamic strategies and hierarchical routing to significantly accelerate large language model inference without retraining models.
Contribution
The paper proposes CAS-Spec, a new method combining dynamic inference acceleration and adaptive routing to improve on-the-fly speculative decoding efficiency for LLMs.
Findings
Achieves 1.1x to 2.3x speedup over autoregressive decoding.
DyTC algorithm improves speedup by 47% and 48% over baseline methods.
Easily integrates into existing LLMs for enhanced inference acceleration.
Abstract
Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods
