Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Yudi Zhang; Weilin Zhao; Xu Han; Tiejun Zhao; Wang Xu; Hailong Cao; Conghui Zhu

arXiv:2505.22179·cs.CL·May 30, 2025

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu

PDF

Open Access 1 Repo

TL;DR

This paper explores combining speculative decoding with quantization for large language models, revealing challenges and proposing a hierarchical framework that significantly improves inference speed.

Contribution

It identifies the limitations of applying speculative decoding to quantized models and introduces a hierarchical framework that enhances speedup in large language model inference.

Findings

01

Hierarchical framework achieves 2.78× speedup on Llama-3-70B.

02

Applying EAGLE-2 to quantized models increases time overhead.

03

Hierarchical approach outperforms EAGLE-2 by 1.31×.

Abstract

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai9stars/specmquant
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Speech Recognition and Synthesis