Graph-Structured Speculative Decoding
Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang,, Xunliang Cai, Dongyan Zhao, Rui Yan

TL;DR
Graph-structured Speculative Decoding (GSD) improves inference speed of large language models by efficiently managing multiple hypotheses with a DAG, significantly outperforming standard methods.
Contribution
We introduce GSD, a novel approach using a DAG to optimize multiple hypotheses in speculative decoding, reducing computation and increasing speed.
Findings
Achieved 1.73× to 1.96× speedup on LLaMA-2 70B.
Effectively manages multiple hypotheses with a DAG structure.
Surpasses standard speculative decoding in efficiency.
Abstract
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
