FastDraft: How to Train Your Draft

Ofir Zafrir; Igor Margulis; Dorin Shteyman; Shira Guskin; Guy Boudoukh

arXiv:2411.11055·cs.CL·June 6, 2025

FastDraft: How to Train Your Draft

Ofir Zafrir, Igor Margulis, Dorin Shteyman, Shira Guskin, Guy Boudoukh

PDF

Open Access 4 Models

TL;DR

FastDraft introduces an efficient pre-training and alignment method for draft models, enabling faster large language model inference on edge devices with minimal resources and time.

Contribution

The paper presents a novel FastDraft approach that efficiently pre-trains and aligns draft models to any large language model, reducing training time and resource requirements.

Findings

01

Draft models achieve up to 3x speedup in code completion

02

FastDraft training completes in under 24 hours on a single server

03

Benchmarking shows up to 2x wall-clock time speedup

Abstract

Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models. However, Speculative Decoding entirely relies on the availability of efficient draft models, which are often lacking for many existing language models due to a stringent constraint of vocabulary compatibility. In this work we introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model by incorporating efficient pre-training, followed by fine-tuning over synthetic datasets generated by the target model. We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able to produce a draft model with approximately 10 billion tokens on a single server with 8 Intel $^{®}$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings