TL;DR
CommVQ introduces a novel commutative vector quantization method that significantly compresses KV caches in large language models, enabling longer context processing with minimal accuracy loss and reduced memory footprint.
Contribution
The paper proposes a new commutative vector quantization technique with a lightweight encoder and codebook, optimized for efficient decoding integrated into self-attention, achieving high compression with low overhead.
Findings
Reduces FP16 KV cache size by 87.5% with 2-bit quantization.
Enables 1-bit KV cache quantization with minimal accuracy loss.
Allows LLaMA-3.1 8B to process 128K context length on a single GPU.
Abstract
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
