Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan, Kobyzev

TL;DR
This paper introduces a lightweight, adaptive self-speculative decoding method for large language models that generates varying draft models on the fly without fine-tuning, achieving competitive inference speed.
Contribution
It proposes a novel, rule-based approach for adaptive draft model generation that is simple, plug-and-play, and does not require additional training or optimization.
Findings
Competitive with state-of-the-art self-speculative decoding methods
Does not require fine-tuning or black-box optimization
Simple and truly plug-and-play implementation
Abstract
We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Algorithms and Data Compression · Natural Language Processing Techniques
