The Synergy of Speculative Decoding and Batching in Serving Large Language Models
Qidong Su, Christina Giannoula, Gennady Pekhimenko

TL;DR
This paper explores how combining speculative decoding and batching can enhance GPU utilization for large language models, proposing an adaptive strategy that optimizes speculation length based on batch size to improve inference efficiency.
Contribution
The paper introduces an adaptive speculative decoding method that dynamically adjusts speculation length according to batch size, improving inference performance for LLMs.
Findings
Optimal speculation length varies with batch size
The proposed adaptive method matches or exceeds fixed-length schemes
Quantitative model explains the relationship between batch size and speculation length
Abstract
Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To study their synergy, we implement a prototype implementation and perform an extensive characterization analysis on various LLM models and GPU architectures. We observe that the optimal speculation length depends on the batch size used. We analyze the key observation and build a quantitative model to explain it. Based on our analysis, we propose a new adaptive speculative decoding strategy that chooses the optimal speculation length for different batch sizes. Our evaluations show that our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Weight Decay · Softmax · Adam
