Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

TL;DR
This paper introduces an adaptive KV-cache quantization method for on-device LLMs that dynamically assigns bit-widths based on token importance, reducing memory and latency while maintaining high accuracy.
Contribution
It proposes a learned policy that adaptively allocates KV precision during decoding, outperforming static and heuristic quantization schemes.
Findings
Reduces decoding latency by up to 17.75% on HellaSwag.
Improves accuracy by up to 7.60 points over static quantization.
Maintains accuracy within 0.30 points of FP16 inference.
Abstract
Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
