EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Yuhui Li; Fangyun Wei; Chao Zhang; Hongyang Zhang

arXiv:2406.16858·cs.CL·July 2, 2024·1 cites

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

EAGLE-2 enhances speculative sampling for large language models by introducing a context-aware dynamic draft tree, significantly speeding up inference while maintaining output quality.

Contribution

It proposes a novel context-aware dynamic draft tree technique for speculative sampling, improving inference speed over previous static methods.

Findings

01

Achieves 3.05x-4.26x speedup ratios

02

20%-40% faster than EAGLE-1

03

Maintains unchanged output distribution

Abstract

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

safeailab/eagle
pytorchOfficial

Models

Videos

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling