Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su,, Weipeng Chen, Bin Cui

TL;DR
Clover introduces a novel speculative decoding method that leverages sequential knowledge to significantly improve efficiency in large language model generation, outperforming existing methods like Medusa.
Contribution
The paper proposes Clover, a new speculative decoding algorithm that integrates sequential knowledge via a Regressive Connection and Attention Decoder to enhance hit rates and decoding efficiency.
Findings
Clover achieves up to 91% efficiency improvement on Baichuan-Small.
Clover outperforms Medusa by up to 57% on Baichuan-Large.
The method significantly reduces memory transfer bottlenecks in GPU-based decoding.
Abstract
Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsError Correcting Code Techniques · Algorithms and Data Compression · Neural Networks and Applications
MethodsALIGN
