AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language   Models via an Entropy-based Lower Bound on Token Acceptance Probability

Sudhanshu Agrawal; Wonseok Jeon; Mingu Lee

arXiv:2410.18351·cs.CL·October 25, 2024

AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability

Sudhanshu Agrawal, Wonseok Jeon, Mingu Lee

PDF

Open Access

TL;DR

AdaEDL introduces an entropy-based early stopping method for speculative decoding in large language models, significantly improving inference efficiency without additional training.

Contribution

It proposes a training-free, adaptive draft stopping criterion based on entropy, outperforming static and other dynamic methods across various datasets.

Findings

01

Outperforms static draft-length methods by 10%-57%.

02

More robust at high sampling temperatures.

03

Seamlessly integrates into existing LLM systems.

Abstract

Speculative decoding is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsEarly Stopping