Cache Me If You Must: Adaptive Key-Value Quantization for Large Language   Models

Alina Shutova; Vladimir Malinovskii; Vage Egiazarian; Denis; Kuznedelev; Denis Mazur; Nikita Surkov; Ivan Ermakov; Dan Alistarh

arXiv:2501.19392·cs.LG·March 3, 2025

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis, Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh

PDF

Open Access 1 Repo 1 Video

TL;DR

AQUA-KV introduces an adaptive quantization method for Key-Value caches in large language models, significantly reducing memory usage while maintaining high accuracy by exploiting key-value dependencies and internal state compression.

Contribution

The paper presents AQUA-KV, a novel adaptive quantization technique that leverages key-value dependencies and internal state compression for efficient LLM caching.

Findings

01

Achieves near-lossless inference at 2-2.5 bits per value.

02

Maintains under 1% relative error in perplexity and LongBench scores.

03

Calibrates on a single GPU within 1-6 hours for 70B models.

Abstract

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

goodevening13/aquakv
pytorchOfficial

Videos

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling

MethodsPruning · LLaMA