Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Zongyue Qin; Zifan He; Neha Prakriya; Jason Cong; Yizhou Sun

arXiv:2409.16560·cs.AI·March 17, 2025

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun

PDF

Open Access

TL;DR

This paper introduces Dynamic-Width Speculative Beam Decoding (DSBD), a novel method that combines speculative decoding with beam sampling to improve the efficiency and quality of large language model inference.

Contribution

The paper proposes a new decoding algorithm that dynamically adjusts beam width and efficiently verifies multiple sequences in parallel, addressing key challenges in integrating speculative decoding with beam sampling.

Findings

01

Achieves faster inference with maintained or improved output quality.

02

Effectively balances efficiency and accuracy through dynamic beam adjustment.

03

Reduces memory overhead in beam sampling during large language model decoding.

Abstract

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression