Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple
Amirhossein Bozorgkhoo, Igor Molybog

TL;DR
This paper introduces a theoretical framework for optimizing the throughput of speculative decoding in language models, enabling prediction of optimal hyperparameters without costly training.
Contribution
It provides an analytical theory linking hyperparameters to inference throughput, simplifying the optimization process for speculative decoding systems.
Findings
The theory accurately predicts throughput-optimal hyperparameters.
It reduces the need for extensive experimental tuning.
Enhances inference efficiency for large language models.
Abstract
Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning and Algorithms · Machine Learning and Data Classification
