Adaptive Draft-Verification for Efficient Large Language Model Decoding

Xukun Liu; Bowen Lei; Ruqi Zhang; Dongkuan Xu

arXiv:2407.12021·cs.CL·August 20, 2024

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu

PDF

Open Access 1 Video

TL;DR

This paper introduces ADED, an adaptive draft-verification method that accelerates large language model decoding without fine-tuning, by dynamically approximating output distributions and balancing exploration and exploitation.

Contribution

The paper presents a novel adaptive draft-verification approach for LLM decoding that improves efficiency without requiring model fine-tuning or fixed retrieval schemes.

Findings

01

ADED significantly speeds up decoding across various benchmarks.

02

The method maintains high accuracy comparable to standard decoding.

03

It adapts to changing token probabilities during generation.

Abstract

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED, which accelerates LLM decoding without requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Adaptive Draft-Verification for Efficient Large Language Model Decoding· underline

Taxonomy

TopicsNatural Language Processing Techniques