The Synergy of Speculative Decoding and Batching in Serving Large   Language Models

Qidong Su; Christina Giannoula; Gennady Pekhimenko

arXiv:2310.18813·cs.LG·October 31, 2023·1 cites

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Qidong Su, Christina Giannoula, Gennady Pekhimenko

PDF

Open Access

TL;DR

This paper explores how combining speculative decoding and batching can enhance GPU utilization for large language models, proposing an adaptive strategy that optimizes speculation length based on batch size to improve inference efficiency.

Contribution

The paper introduces an adaptive speculative decoding method that dynamically adjusts speculation length according to batch size, improving inference performance for LLMs.

Findings

01

Optimal speculation length varies with batch size

02

The proposed adaptive method matches or exceeds fixed-length schemes

03

Quantitative model explains the relationship between batch size and speculation length

Abstract

Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To study their synergy, we implement a prototype implementation and perform an extensive characterization analysis on various LLM models and GPU architectures. We observe that the optimal speculation length depends on the batch size used. We analyze the key observation and build a quantitative model to explain it. Based on our analysis, we propose a new adaptive speculative decoding strategy that chooses the optimal speculation length for different batch sizes. Our evaluations show that our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Weight Decay · Softmax · Adam