Graph-Structured Speculative Decoding

Zhuocheng Gong; Jiahao Liu; Ziyue Wang; Pengfei Wu; Jingang Wang,; Xunliang Cai; Dongyan Zhao; Rui Yan

arXiv:2407.16207·cs.CL·July 24, 2024

Graph-Structured Speculative Decoding

Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang,, Xunliang Cai, Dongyan Zhao, Rui Yan

PDF

TL;DR

Graph-structured Speculative Decoding (GSD) improves inference speed of large language models by efficiently managing multiple hypotheses with a DAG, significantly outperforming standard methods.

Contribution

We introduce GSD, a novel approach using a DAG to optimize multiple hypotheses in speculative decoding, reducing computation and increasing speed.

Findings

01

Achieved 1.73× to 1.96× speedup on LLaMA-2 70B.

02

Effectively manages multiple hypotheses with a DAG structure.

03

Surpasses standard speculative decoding in efficiency.

Abstract

Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus