QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV   Cache

Rishabh Tiwari; Haocheng Xi; Aditya Tomar; Coleman Hooper; Sehoon Kim,; Maxwell Horton; Mahyar Najibi; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

arXiv:2502.10424·cs.LG·February 18, 2025

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim,, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

PDF

Open Access

TL;DR

QuantSpec introduces a self-speculative decoding framework with hierarchical 4-bit quantized KV cache and weights, achieving over 90% acceptance rates and up to 2.5x speedup in long-context LLM inference while reducing memory usage.

Contribution

It proposes a novel self-speculative decoding method using hierarchical 4-bit quantization for KV caches and weights, significantly improving speed and memory efficiency.

Findings

01

Achieves end-to-end speedups up to 2.5x.

02

Maintains high acceptance rates over 90%.

03

Reduces memory requirements by approximately 1.3x.

Abstract

Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Cellular Automata and Applications · Error Correcting Code Techniques