CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Longze Chen; Renke Shan; Huiming Wang; Lu Wang; Ziqiang Liu; Run Luo; Jiawei Wang; Hamid Alinejad-Rokny; Min Yang

arXiv:2505.24196·cs.CL·June 2, 2025

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, Min Yang

PDF

Open Access 1 Video

TL;DR

CLaSp introduces a plug-and-play in-context layer-skipping method for self-speculative decoding that accelerates large language model decoding without additional training or modules.

Contribution

It proposes a novel dynamic programming-based layer-skipping strategy for self-speculative decoding that does not require extra training or modules.

Findings

01

Achieves 1.3x to 1.7x speedup on LLaMA3 models

02

Maintains original text distribution during acceleration

03

Works across diverse downstream tasks

Abstract

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CLaSp: In-Context Layer Skip for Self-Speculative Decoding· underline

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · Neural Networks and Applications