KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski; Adrian {\L}a\'ncucki

arXiv:2511.01815·cs.CL·March 12, 2026

KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian {\L}a\'ncucki

PDF

Open Access 3 Reviews

TL;DR

KVTC is a novel transform coding method that significantly compresses key-value caches in large language models, enabling more memory-efficient inference without sacrificing accuracy.

Contribution

Introduces KVTC, a lightweight, PCA-based transform coder that compresses KV caches for LLM inference, improving storage efficiency while maintaining model performance.

Findings

01

Achieves up to 20× compression of KV caches.

02

Maintains accuracy in reasoning and long-context tasks.

03

Outperforms existing inference-time compression methods.

Abstract

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20 $\times$ compression while maintaining reasoning and long-context accuracy, and 40 $\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

+ The paper presents an interesting approach by using a transform-coding framework for KV cache compression. Experiments demonstrate that this method can maintain model performance even with a high compression rate. + The ablation experiments are very thorough.

Weaknesses

+ The paper uses a limited number of evaluation benchmarks. Recent related works typically conduct comprehensive experiments on benchmarks like LongBench and RULER, whereas this paper only uses one dataset from each (Qasper and VT). + The "Related Work" section should also include content related to transform coding, as it is central to the proposed approach in this paper.

Reviewer 02Rating 6Confidence 4

Strengths

1. The KV cache is a key bottleneck for efficient LLM inference. The paper identifies and effectively tackles this real-world challenge. 2. KVTC uses well-established techniques (PCA, quantization, entropy coding) in a novel context. It requires no retraining and can be plugged into existing frameworks. 3. The authors test on diverse LLMs (Llama 3.1, Mistral-NeMo, R1-Distilled Qwen2.5) and datasets (MMLU, GSM8K, RULER, MATH500, etc.). The paper reports improvements in both latency and memory usa

Weaknesses

1. **Limited algorithmic novelty**: Transform coding with PCA+quantization is classical, and several SVD/quant methods exist (e.g., SVDq, xKV). The core components are classical in signal processing. The main contribution is an effective adaptation of known techniques, rather than a fundamentally new algorithm. 2. **Dependence on calibration data**: The PCA basis and bit allocation depend on a representative calibration dataset. When model structure changes, recalibration is required. There is

Reviewer 03Rating 6Confidence 4

Strengths

This paper is well written and easy to follow. The paper’s main strengths lie in its practicality, simplicity, and effectiveness. It introduces a lightweight, system-friendly transform-coding approach (kvtc) that achieves up to 20× compression of KV caches with minimal accuracy loss and no model modifications.

Weaknesses

The proposed method leverages the traditional transform-coding framework, achieving high compression ratios with minimal performance degradation. However, its limitation appears to be decompression latency. How does the decompression time compare with other methods listed in Table 2? In Table 4, the authors report only the compression and decompression times of the proposed approach, which seems insufficient for a fair comparison.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Parallel Computing and Optimization Techniques