VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Raghavv Goel; Sudhanshu Agrawal; Mukul Gagrani; Junyoung Park; Yifan Zao; He Zhang; Tian Liu; Yiping Yang; Xin Yuan; Jiuyan Lu; Chris Lott; Mingu Lee

arXiv:2506.22694·cs.CL·September 30, 2025

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee

PDF

Open Access

TL;DR

VocabTrim is a training-free vocabulary pruning technique that reduces speculative decoding overhead in large language models, significantly improving generation speed especially on memory-bound devices.

Contribution

VocabTrim introduces a simple method to prune the draft model's vocabulary based on target model sampling, enhancing decoding efficiency without retraining.

Findings

01

Achieves 16% memory-bound speed-up for Llama-3.2-3B-Instruct

02

Reduces drafting latency significantly in memory-bound environments

03

Maintains acceptable acceptance rates despite vocabulary pruning

Abstract

In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems