Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Zachary Ankner; Rishab Parthasarathy; Aniruddha Nrusimha; Christopher; Rinard; Jonathan Ragan-Kelley; William Brandon

arXiv:2402.05109·cs.LG·October 8, 2024·1 cites

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher, Rinard, Jonathan Ragan-Kelley, William Brandon

PDF

Open Access 1 Repo

TL;DR

Hydra heads are a new sequentially-dependent draft head design for Medusa decoding that enhances speculative decoding accuracy and significantly boosts decoding throughput in large language models.

Contribution

Introduction of Hydra heads, a sequentially-dependent draft head architecture, and Hydra++, a tuned recipe that improves decoding speed and accuracy over existing methods.

Findings

01

Hydra heads improve draft head speculation accuracy.

02

Hydra++ increases decoding throughput by up to 2.70x.

03

The proposed method enhances end-to-end speed of speculative decoding.

Abstract

To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding frame-work. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of lightweight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads: a sequentially-dependent drop-in replacement for standard draft heads that significantly improves the accuracy of draft head speculation. We further explore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zankner/hydra
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Classical Antiquity Studies · Topic Modeling

MethodsBalanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Hydra