OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Ramchalam Kinattinkara Ramakrishnan; Zhaocong Yuan; Shaojie Zhuo; Chen Feng; Yicheng Lin; Chenzheng Su; Xiaopeng Zhang

arXiv:2507.02659·cs.LG·October 15, 2025

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

PDF

TL;DR

OmniDraft is a versatile, online adaptive drafting framework that allows a single draft model to work efficiently with various target models, improving decoding speed and customization for on-device large language model applications.

Contribution

It introduces an online n-gram cache with hybrid distillation to enable cross-vocabulary compatibility and dynamic adaptation in a unified draft model framework.

Findings

01

Enables a single Llama-68M to pair with multiple target models.

02

Achieves up to 1.5-2x speedup in decoding.

03

Demonstrates effectiveness on math, coding, and text generation tasks.

Abstract

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.