Learning Harmonized Representations for Speculative Sampling
Lefan Zhang, Xiaodan Wang, Yanhua Huang, Ruiwen Xu

TL;DR
HASS introduces a harmonized representation learning approach to improve speculative sampling in LLM decoding, achieving significant speedups while maintaining efficiency and addressing training-decoding discrepancies.
Contribution
It proposes HArmonized Speculative Sampling (HASS), a novel method that aligns training and decoding objectives to accelerate LLM decoding without extra inference costs.
Findings
Achieves 2.81x-4.05x speedup on LLaMA models.
Surpasses EAGLE-2 by 8%-20% in speed.
Effective on multiple datasets.
Abstract
Speculative sampling is a promising approach to accelerate the decoding stage for Large Language Models (LLMs). Recent advancements that leverage target LLM's contextual information, such as hidden states and KV cache, have shown significant practical improvements. However, these approaches suffer from inconsistent context between training and decoding. We also observe another discrepancy between the training and decoding objectives in existing speculative sampling methods. In this work, we propose a solution named HArmonized Speculative Sampling (HASS) that learns harmonized representations to address these issues. HASS accelerates the decoding stage without adding inference overhead through harmonized objective distillation and harmonized context alignment. Experiments on four LLaMA models demonstrate that HASS achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three…
Peer Reviews
Decision·ICLR 2025 Poster
The experiments are well-detailed, with clear metrics for comparison against baselines (e.g., EAGLE and multiple architectures). The method makes sense in the perspective of typical acceptance process in SD.
1. While the paper introduces a novel approach, it does not sufficiently explore the construction and utilization of the self-distillation dataset. Isn't the quality and configuration of this dataset more crucial than the framework itself? A deeper discussion on how the dataset is designed and its influence on model performance would strengthen the claims [Referring to A]. 2. The paper lacks experiments analyzing how the framework performs across different token counts and task types [Referring
The paper carefully considers the two important aspects of speculative decoding efficiency: the ability of the draft model to predict top-K tokens and the number of tokens the draft model can successfully predict. Authors identify respective shortcomings of the recent method EAGLE and propose practical solutions. - Empirical results are convincing and demonstrate meaningful improvements in terms of acceptance rate and speedups - Authors present extensive ablation studies providing additional in
A more rigorous mathematical presentation of the loss function in Section 3.2 (harmonized context alignment) would help to improve clarity. It is not easy to parse what is the input to the draft model at every step and what is it trying to predict. Writing out the objective function for training the draft model would help. I am also confused by Figure 3 - should Training Step 2 have $f^{(l)}\_{t-2}$ and $f^{(l)}\_{t-1}$ in the bottom row and only use superscript $(s_1)$ in the bottom right entry
1. The motivation to improve the current speculative decoding method is clear and well-founded. 2. The paper is well-written and easy to follow. 3. The proposed HASS significantly outperforms other baseline methods and achieves impressive speed-up performance. Without introducing extra inference costs, HASS can be nearly considered a 'free lunch' while incurring only acceptable additional training costs (3 or 4 extra alignment steps). 4. The authors conduct interesting ablation experiments to pr
1. The training efficiency of this method appears to have limitations. During the additional training steps (step 2, 3, 4...), each token’s corresponding KV matrix differs, making it infeasible to compute all attention scores through a single matrix multiplication. Could the authors offer a solution for this issue or provide detailed reports on the specific training time costs? 2. Section 3.2 (HARMONIZED CONTEXT ALIGNMENT) is the core of the proposed method, but the description is overly brief.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Machine Learning and Algorithms · Face and Expression Recognition
MethodsLLaMA · Focus
