Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher, Rinard, Jonathan Ragan-Kelley, William Brandon

TL;DR
Hydra heads are a new sequentially-dependent draft head design for Medusa decoding that enhances speculative decoding accuracy and significantly boosts decoding throughput in large language models.
Contribution
Introduction of Hydra heads, a sequentially-dependent draft head architecture, and Hydra++, a tuned recipe that improves decoding speed and accuracy over existing methods.
Findings
Hydra heads improve draft head speculation accuracy.
Hydra++ increases decoding throughput by up to 2.70x.
The proposed method enhances end-to-end speed of speculative decoding.
Abstract
To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding frame-work. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of lightweight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads: a sequentially-dependent drop-in replacement for standard draft heads that significantly improves the accuracy of draft head speculation. We further explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Classical Antiquity Studies · Topic Modeling
MethodsBalanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Hydra
