PQCache: Product Quantization-based KVCache for Long Context LLM   Inference

Hailin Zhang; Xiaodong Ji; Yilin Chen; Fangcheng Fu; Xupeng Miao,; Xiaonan Nie; Weipeng Chen; Bin Cui

arXiv:2407.12820·cs.CL·April 1, 2025

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao,, Xiaonan Nie, Weipeng Chen, Bin Cui

PDF

Open Access

TL;DR

PQCache leverages product quantization to efficiently manage key-value caches in large language models, significantly reducing memory bottlenecks and latency while maintaining model quality during long-context inference.

Contribution

The paper introduces PQCache, a novel method applying product quantization to KVCache in LLMs, balancing memory efficiency and inference quality.

Findings

01

Achieves 4.60% score improvement on InfiniteBench

02

Reduces serving latency during inference

03

Maintains model quality with low overhead

Abstract

As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques prevalent in the data management community, we consider the storage and retrieval of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling