Adaptive Draft-Verification for Efficient Large Language Model Decoding
Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu

TL;DR
This paper introduces ADED, an adaptive draft-verification method that accelerates large language model decoding without fine-tuning, by dynamically approximating output distributions and balancing exploration and exploitation.
Contribution
The paper presents a novel adaptive draft-verification approach for LLM decoding that improves efficiency without requiring model fine-tuning or fixed retrieval schemes.
Findings
ADED significantly speeds up decoding across various benchmarks.
The method maintains high accuracy comparable to standard decoding.
It adapts to changing token probabilities during generation.
Abstract
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED, which accelerates LLM decoding without requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
