AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of   Adaptive Draft Structures

Situo Zhang; Hankun Wang; Da Ma; Zichen Zhu; Lu Chen; Kunyao Lan; Kai; Yu

arXiv:2412.18910·cs.AI·December 30, 2024

AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures

Situo Zhang, Hankun Wang, Da Ma, Zichen Zhu, Lu Chen, Kunyao Lan, Kai, Yu

PDF

Open Access

TL;DR

AdaEAGLE is a novel speculative decoding framework that explicitly models adaptive draft structures, significantly improving inference speed of large language models by predicting optimal draft lengths without manual thresholds.

Contribution

It introduces AdaEAGLE, the first framework to explicitly model adaptive draft structures using a lightweight predictor, enabling deeper optimization and better speedup over existing methods.

Findings

01

Achieves 1.62x speedup over vanilla autoregressive decoding.

02

Outperforms fixed-length state-of-the-art baselines.

03

Maintains output quality while accelerating inference.

Abstract

Speculative Decoding (SD) is a popular lossless technique for accelerating the inference of Large Language Models (LLMs). We show that the decoding speed of SD frameworks with static draft structures can be significantly improved by incorporating context-aware adaptive draft structures. However, current studies on adaptive draft structures are limited by their performance, modeling approaches, and applicability. In this paper, we introduce AdaEAGLE, the first SD framework that explicitly models adaptive draft structures. AdaEAGLE leverages the Lightweight Draft Length Predictor (LDLP) module to explicitly predict the optimal number of draft tokens during inference to guide the draft model. It achieves comparable speedup results without manual thresholds and allows for deeper, more specialized optimizations. Moreover, together with threshold-based strategies, AdaEAGLE achieves a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings