KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference

Sai Gokhale; Devleena Das; Rajeev Patwari; Ashish Sirasao; Elliott Delaye

arXiv:2512.01953·cs.LG·December 2, 2025

KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference

Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, Elliott Delaye

PDF

Open Access 1 Video

TL;DR

KV Pareto systematically explores and optimizes the trade-offs between memory usage and accuracy for long-context LLM inference by combining multiple KV cache and model compression techniques, enabling efficient deployment.

Contribution

Introduces KV Pareto, a framework that jointly optimizes KV cache and model compression techniques for long-context LLMs, achieving significant memory reduction with minimal accuracy loss.

Findings

01

Achieves 68-78% memory reduction with 1-3% accuracy loss.

02

Validates Pareto configurations across multiple benchmarks and extended context lengths.

03

Demonstrates the importance of joint optimization for practical long-context LLM inference.

Abstract

Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference· underline

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Speech Recognition and Synthesis