Boosting Lossless Speculative Decoding via Feature Sampling and Partial   Alignment Distillation

Lujun Gui; Bin Xiao; Lei Su; Weipeng Chen

arXiv:2408.15562·cs.CL·August 29, 2024

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen

PDF

Open Access

TL;DR

FSPAD enhances lossless speculative decoding by sampling target features and applying partial alignment distillation, significantly improving speed and accuracy across various large language models and tasks.

Contribution

The paper introduces FSPAD, a novel method combining feature sampling and partial alignment distillation to boost lossless speculative decoding performance.

Findings

01

Outperforms state-of-the-art methods on multiple tasks

02

Effective across different LLM sizes and architectures

03

Improves decoding speed and accuracy

Abstract

Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Data Compression Techniques · Algorithms and Data Compression