Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu, Xu Yang

TL;DR
This paper introduces a data-centric method for improving speculative decoding in large language models by selecting training samples with flatter predictive distributions, leading to faster training with minimal accuracy loss.
Contribution
It proposes a new flatness metric and a dataset distillation method that filters valuable samples, significantly accelerating training for speculative decoding.
Findings
SFDD achieves over 2x training speedup with 50% data
Final model inference speed remains within 4% of full data baseline
The flatness metric effectively identifies valuable training samples
Abstract
Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2 training speedup using only 50% of the data,…
Peer Reviews
Decision·ICLR 2026 Poster
- Novel metric (target model flatness) that could provide better guidance for draft model training. - Significant data efficiency: achieves similar performance with half the data. - Shows better results than existing baselines, indicating practical impact.
- Experiments are limited to Llama3.1-8B-Instruct; unclear if results generalize to other model sizes or families. - Missing discussion of related work on efficient draft model training (e.g., [1] Goel et al., 2024). - Lack of clarity on training details (epochs, convergence criteria). Some tables are confusing: - Table 1 speed-up discrepancies vs. acceptance length raise concerns about hardware or redundancy: - For GSM8K, acceptance lengths for No Filter, SFDD, and PPL are 3.28,2.95,and 2.79
- The paper aims to develop a theoretical underpinning for the data selection for draft model training. - The proposed method, namely SFDD, strikes a good trade-off between training efficiency and inference speedup across multiple datasets. SFDD outperforms other selection criteria from the literature when selecting 50% of the available data from the draft LM training. - The paper provides ablation studies that highlight the utility of SFDD as one varies the fraction of data selected for the dr
- The theoretical analysis in the paper is based on the assumption that the underlying distributions are Gaussian, which could be far from the discrete distributions produced by an LM. Did the authors consider working with other distributions such as Exponential and Half-normal distribution. - The empirical evaluation in the paper is a bit limited. The authors may want to expand their empirical study by exploring more LLMs and training datasets. - The paper does not provide insights on why other
1. The paper provides a clear, theoretically motivated insight that tokens with flat target distributions are more valuable for SD training, which goes against conventional wisdom in standard KD. 2. The proposed flatness metric is simple, target-model-only, and computable offline. It consistently outperforms multiple baselines across diverse tasks and data retention ratios. 3. SFDD offers a plug-and-play method to significantly reduce training cost for SD without architectural or loss-function c
1. The Gaussian approximation is elegant but may not fully capture the heavy-tailed, sparse nature of real LLM output distributions. The robustness of conclusions to this modeling choice could be better addressed. 2. While flatness is shown to outperform entropy empirically, the paper does not deeply analyze why cosine similarity to uniform is superior to entropy as a flatness proxy since both measure distributional spread. 3. Although SFDD reduces the training cost of the draft model, it requir
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Natural Language Processing Techniques
