FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu; Chengwei Li; Zhenyu Ning; Jing Lin; Yiwu Yao; Danning Ke; Minyi Guo; Jieru Zhao

arXiv:2505.13109·cs.LG·March 10, 2026

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao

PDF

Open Access 3 Reviews

TL;DR

FreeKV is a training-free framework that significantly improves KV cache retrieval efficiency for large language models, achieving up to 13 times speedup while maintaining near-lossless accuracy across various models and scenarios.

Contribution

FreeKV introduces speculative retrieval and hybrid KV layouts, enabling efficient, accurate, and training-free KV cache management for LLM inference.

Findings

01

Achieves up to 13× speedup over state-of-the-art methods.

02

Maintains near-lossless accuracy across multiple models.

03

Effectively overlaps retrieval with computation for latency hiding.

Abstract

Large language models (LLMs) are widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a training-free algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* The paper offers a clear algorithm–system co-design in which speculative reuse moves selection and recall off the critical path and head-wise correction restores accuracy only when needed. * The hybrid HND/NHD layout and double-buffered streamed recall directly target PCIe fragmentation and enable effective overlap, which is a practical systems contribution likely to transfer to real serving stacks. * The empirical study spans multiple models and benchmarks with detailed settings, and demonstr

Weaknesses

* LongBench v2 spans 8K to 2M tokens, yet the paper truncates inputs to 64K and caps generation to 16K, which leaves the very-long regime underexplored where offloading dominates; please add at least one ≥128K long-input case and one ≥32K long-generation case to validate scaling and stress the recall pipeline. * Since accuracy relies on an LLM-as-judge (Qwen-3-32B), consider strengthening the evaluation with a larger or multi-judge setup, report inter-judge agreement to calibrate the scores.

Reviewer 02Rating 6Confidence 3

Strengths

1. Clear motivation and strong setup: The paper provides a well-motivated problem statement supported by preliminary empirical analysis. 2. Novel speculative retrieval mechanism: The proposed speculative retrieval with fine-grained correction is a conceptually novel idea that effectively breaks the strict dependency between KV selection and query scoring, enabling computation–I/O overlap. 3. Comprehensive and convincing experiments: The paper evaluates FreeKV across diverse models and tasks (e.

Weaknesses

1. Missing analysis of correction overheads: While the fine-grained correction mechanism is central to FreeKV’s efficiency–accuracy balance, the paper lacks a quantitative analysis of correction frequency and its impact on latency under different similarity thresholds. Such results would clarify the trade-off between performance and accuracy. 2. Limited study on KV budget sensitivity: The experiments fix the KV budget B but do not explore how varying B influences accuracy and throughput.

Reviewer 03Rating 2Confidence 5

Strengths

1. The paper tackles an important and practical problem in LLM serving: efficient KV retrieval under long contexts. 2. FreeKV presents well-motivated system–algorithm co-design with comprehensive experiments covering multiple models and tasks, demonstrating impressive empirical speedups and accuracy preservation.

Weaknesses

1. FreeKV improves runtime efficiency via engineering and overlap techniques. Its algorithmic novelty is incremental over prior work like InfiniGen and ArkVale. 2. The speculative reuse in FreeKV depends on strong query similarity assumptions that may not generalize to all model architectures or reasoning tasks.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques