Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen

TL;DR
FSPAD enhances lossless speculative decoding by sampling target features and applying partial alignment distillation, significantly improving speed and accuracy across various large language models and tasks.
Contribution
The paper introduces FSPAD, a novel method combining feature sampling and partial alignment distillation to boost lossless speculative decoding performance.
Findings
Outperforms state-of-the-art methods on multiple tasks
Effective across different LLM sizes and architectures
Improves decoding speed and accuracy
Abstract
Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Data Compression Techniques · Algorithms and Data Compression
