FR-Spec: Accelerating Large-Vocabulary Language Models via   Frequency-Ranked Speculative Sampling

Weilin Zhao; Tengyu Pan; Xu Han; Yudi Zhang; Ao Sun; Yuxiang Huang,; Kaihuo Zhang; Weilun Zhao; Yuxuan Li; Jianyong Wang; Zhiyuan Liu; Maosong Sun

arXiv:2502.14856·cs.CL·March 12, 2025

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang,, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun

PDF

Open Access 3 Models 1 Video

TL;DR

FR-Spec introduces a frequency-ranked speculative sampling method that significantly reduces computational overhead and accelerates large-vocabulary language model generation by prioritizing frequent tokens.

Contribution

It proposes a novel vocabulary compression technique for speculative sampling, improving efficiency for large-vocabulary LLMs without sacrificing output quality.

Findings

01

75% reduction in LM Head computation

02

Average 1.12× speedup over EAGLE-2

03

Maintains output distribution equivalence

Abstract

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis