Fast Inference via Hierarchical Speculative Decoding
Clara Mohri, Haim Kaplan, Tal Schuster, Yishay Mansour, Amir Globerson

TL;DR
Hierarchical Speculative Decoding (HSD) improves inference speed in transformer models by stacking multiple draft models in a hierarchy, enabling faster token verification and reducing latency.
Contribution
The paper introduces HSD, a hierarchical approach to speculative decoding that optimally combines multiple draft models for faster inference in language models.
Findings
HSD achieves up to 1.2x speed-up over single-draft methods.
The optimal hierarchy can be computed in polynomial time.
HSD effectively reduces generation latency in practice.
Abstract
Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
