Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine   Similarity

Michael R. Metel; Peng Lu; Boxing Chen; Mehdi Rezagholizadeh; Ivan; Kobyzev

arXiv:2410.01028·cs.CL·October 3, 2024

Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Rezagholizadeh, Ivan, Kobyzev

PDF

Open Access

TL;DR

This paper introduces a lightweight, adaptive self-speculative decoding method for large language models that generates varying draft models on the fly without fine-tuning, achieving competitive inference speed.

Contribution

It proposes a novel, rule-based approach for adaptive draft model generation that is simple, plug-and-play, and does not require additional training or optimization.

Findings

01

Competitive with state-of-the-art self-speculative decoding methods

02

Does not require fine-tuning or black-box optimization

03

Simple and truly plug-and-play implementation

Abstract

We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Algorithms and Data Compression · Natural Language Processing Techniques