CommVQ: Commutative Vector Quantization for KV Cache Compression

Junyan Li; Yang Zhang; Muhammad Yusuf Hassan; Talha Chafekar; Tianle Cai; Zhile Ren; Pengsheng Guo; Foroozan Karimzadeh; Colorado Reed; Chong Wang; Chuang Gan

arXiv:2506.18879·cs.CL·June 24, 2025

CommVQ: Commutative Vector Quantization for KV Cache Compression

Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan

PDF

1 Repo

TL;DR

CommVQ introduces a novel commutative vector quantization method that significantly compresses KV caches in large language models, enabling longer context processing with minimal accuracy loss and reduced memory footprint.

Contribution

The paper proposes a new commutative vector quantization technique with a lightweight encoder and codebook, optimized for efficient decoding integrated into self-attention, achieving high compression with low overhead.

Findings

01

Reduces FP16 KV cache size by 87.5% with 2-bit quantization.

02

Enables 1-bit KV cache quantization with minimal accuracy loss.

03

Allows LLaMA-3.1 8B to process 128K context length on a single GPU.

Abstract

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umass-embodied-agi/commvq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.