Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis, Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh

TL;DR
AQUA-KV introduces an adaptive quantization method for Key-Value caches in large language models, significantly reducing memory usage while maintaining high accuracy by exploiting key-value dependencies and internal state compression.
Contribution
The paper presents AQUA-KV, a novel adaptive quantization technique that leverages key-value dependencies and internal state compression for efficient LLM caching.
Findings
Achieves near-lossless inference at 2-2.5 bits per value.
Maintains under 1% relative error in perplexity and LongBench scores.
Calibrates on a single GPU within 1-6 hours for 70B models.
Abstract
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsPruning · LLaMA
