AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu, Jiayi Yao, Siddhant Ray, Samuel Shen, Yihua Cheng, Ganesh Ananthanarayanan, Junchen Jiang

TL;DR
AdaptCache introduces a lossy compression system for KV caches in LLM serving, significantly reducing delay and improving quality by optimizing cache storage and retrieval strategies.
Contribution
It proposes a dynamic, adaptive compression approach for KV caches that balances delay and quality, outperforming static methods in LLM applications.
Findings
Achieves 1.43--2.4x delay savings at same quality
Attains 6--55% quality improvements at same delay
Reduces loading delays significantly in large-scale LLM serving
Abstract
Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage. However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Data Storage Technologies
AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
Shaoting Feng*∗,1* Hanchen Li*∗,1* Kuntai Du1 Zhuohan Gu1 Yuhan Liu1 Jiayi Yao1 Siddhant Ray1 Samuel Shen1 Yihua Cheng1 Ganesh Ananthanarayanan2 Junchen Jiang1
1University of Chicago 2Microsoft
Abstract.
Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage.
However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache entry to maximise DRAM hits and minimise loading delay without significantly degrading generation quality. Compared to various static compression baselines across three tasks, our system AdaptCache achieves 1.43–2.4 delay savings at the same quality and 6–55% quality improvements at the same delay.
11footnotetext: Equal contribution
1. Introduction
With the accelerating adoption of Large Language Models (LLMs), inference systems must provide low-delay responses to meet growing user demands. To achieve this, many systems store and reuse the KV cache of previously processed contexts (e.g., chat history) to reduce redundant computation and speed up inference. However, due to the rapid growth of LLM applications, the size of KV caches also quickly increases and exceeds the storage capacity of both GPU and CPU memory. For example, even with two H100 nodes (each with 80GB GPU memory) and a total of 150GB CPU memory, models like Llama-3.1-70B-Instruct can only store the KV caches of 500K tokens, roughly corresponding to the chat histories of 30 users. This issue becomes more severe in agentic applications, where even a single run involves multiple agents that continuously exchange the text with each other, resulting in a context of millions of tokens and the corresponding KV cache size of 3TB. This underscores the need to fully leverage hierarchical storage within the datacenter to ensure high cache hit rates, reduce delay, and increase throughput.
There is a line of work (Chen et al., 2025; Gao et al., 2024) that explores storing KV caches in hierarchical storage (in this paper we focus on DRAM and SSD). However, we observe that these approaches still have high delay, because high-speed devices have limited storage space, which forces most of the KV caches to be stored to SSD, leading to slow KV cache loading time overall. To allow more cache hits on DRAM, we identify lossy KV cache compression (Liu et al., 2024; Xiao et al., 2023) as a promising approach: by lossily compressing the KV caches, we can store much more KV cache entries in CPU memory without significant degredation of generation quality111Generation quality is defined as the similarity between the generated answer after compression and the original prefill answer, measured by task-specific metrics, i.e., F1, ROUGE-L (Lin, 2004) or CodeBLEU (Ren et al., 2020)..
As a result, our goal is to design a lossy KV cache storage system on a hierarchical storage (DRAM and SSD) that ensures low inference latency with minimum quality degredation.
This paper argues that, instead of applying the same compression (i.e., using the same compression algorithm and compression rate across devices) to all KV caches, we need to adaptively adjust the compression algorithm, compression rate and device placement to achieve optimal KV cache compression. Concretely:
Compression algorithm needs to be aware of the context - Some texts are only important at the beginning and the end (which is suitable for token dropping), but some texts aim to provide new information for the LLM (where quantization might be better).
Compression ratio and device placement need to be aware of the content and the reuse frequency. Some KV caches are frequently reused and easy-to-be-compressed and they need to be stored in CPU memory with a high compression ratio, but some KV caches are not reused frequently and difficult to compress, so it may be suitable to place them on disk with a low compression ratio.
Figure 1 shows that, if we optimally adapt the compression algorithm, rate and device, the potential of improvement is huge: we can achieve much higher KV cache hit rate on CPU and much lower average loading time from disk, while still having comparable accuracy as applying same compression across devices.
To address these challenges, we propose AdaptCache, the first hierarchical KV cache storage system leveraging lossy KV cache compression. The core of AdaptCache is a utility metric that allows us to quantify the effect of storing KV caches in terms of both quality and loading delay. Then, we use marginal utility gain to evaluate the effect of each design decisions – compression algorithm, compression rate, cache placement and eviction – and pick the design decision with the highest marginal utility gain. Acknowledging that this design is greedy and not necessarily optimal, we argue that it suffices to deliver significant delay reduction and accuracy improvement in practice, and optimal solution is not trackable since it is NP-hard.
To evaluate AdaptCache, we use the popular Llama-3.1-8B-Instruct model and three LongBench datasets. Compared to the strongest fixed compression baseline, AdaptCache is 1.43–2.4 faster at the same quality and improves quality by 6–55% under the same delay.
2. AdaptCache Design
AdaptCache has three main components: an estimator, a policy optimizer, and an executor. The estimator does offline profiling to estimate the quality–delay curve for each entry, compression method and storage device. The policy optimizer makes compression decisions for incoming and existing KV cache entries. The executor implements the compression decisions.
Estimator: Our offline profiler uses dummy questions to measure the transfer delays of each storage device and decompression overhead. It also samples ten entries from each dataset, using questions generated by GPT 4o, to generate the quality–compression rate curve for each compression method. We estimate the future cache hit frequency of one KV cache entry using its historical hit frequency.
Utility definition: The objective of AdaptCache is to minimize average system delay while maintaining high quality. To capture this trade-off, we compute a quality-delay weighted sum for each KV cache entry. Since some entries are reused more frequently than others, storing these in limited storage can increase the overall hit rate thus reducing delay. Therefore, we define each entry’s utility as the product of its estimated future occurrence frequency and its quality-delay weighted sum:
[TABLE]
where represents the i-th KV cache entry, and are the compression method and rate used.
The objective of AdaptCache can be expressed as to maximise the total utility across all KV cache entries. Mathematically, this is an NP-hard Multi-Choice Knapsack Problem (MCKP). We adopt a greedy approach based on the textbook solution (Kellerer et al., 2004) that achieves Linear Programming Optimality. When a new KV cache entry is generated, we compute the following metric for both the new entry and all entries in storage. We define
[TABLE]
as the marginal utility drop, i.e., the utility drop per unit of space saved if we compress furthermore or evict the entry. and are different compression rates.
For each storage tier, we will greedily decide the compression choice (to evict, store in full quality or compress each entry) that results in the minimal marginal utility drop.
3. Preliminary Results
Evaluation setup: We use 1,100 contexts from six LongBench datasets (Bai et al., 2023; Gliwa et al., 2019; Zhong et al., 2021; Joshi et al., 2017; Yang et al., 2018; Liu et al., 2023; Guo et al., 2023) to evaluate AdaptCache, covering three task types, summarization, question answering (QA) and coding. These datasets do not contain the timestamps for request arrival, thus we follow previous works to generate request arritval timestamps using Poisson distribution with different request rates (Kwon et al., 2023; Xue et al., 2020; Gao et al., 2024). Experiments are run on one NVIDIA A100 GPU (100 GB DRAM, 400 GB SSD) using Llama-3.1-8B-Instruct model. The disk reading throughput is 1 GB/s. The baselines we compare are:
Without Compression denotes offloading KV cache to DRAM and SSD without compression.
KIVI LRU / StreamingLLM LRU use KIVI (Liu et al., 2024) and StreamingLLM (Xiao et al., 2023) with fixed compression rates, and offload KV cache to DRAM and SSD using a Least Recently Used (LRU) policy.
Prefill denotes prefilling (recomputation).
Overall results: As shown in Figure 2, across three tasks, AdaptCache consistently reduces time to first token (TTFT) compared to all baselines. Compared to naive prefill and offloading, AdaptCache reduces TTFT by 56%, achieving within 15% quality drop. Compared to KIVI, AdaptCache reduces TTFT by 69% at the same quality. Finally, compared to StreamingLLM, AdaptCache greatly improves quality by 15–89% at the same TTFT.
Understanding AdaptCache’s improvements: AdaptCache outperforms approaches that use no compression or fixed compression rates by detecting speed differences between devices and aggressively compressing the KV cache. This strategy significantly increases the hit rate on high-speed devices, leading to much lower loading delay and reduced TTFT. For example, in the coding task, KIVI LRU achieves a DRAM cache hit rate of 38% when quantized to 2 bits. In contrast, our method achieves hit rates of 81%, 56%, 44%, and 11%, for different weights between delay and quality. AdaptCache also prioritises compressing and storing KV cache from longer contexts or those with high information redundancy, and selects optimal compression algorithms for each entry, thereby ensuring overall high response quality.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, and J. Li (2023) Long Bench: a bilingual, multitask benchmark for long context understanding . ar Xiv preprint ar Xiv:2308.14508 . Cited by: §3 .
- 2[2] W. Chen, S. He, H. Qu, R. Zhang, S. Yang, P. Chen, Y. Zheng, B. Huai, and G. Chen (2025) { \{ impress } \} : An { \{ importance-informed } \} { \{ multi-tier } \} prefix { \{ kv } \} storage system for large language model inference . In 23rd USENIX Conference on File and Storage Technologies (FAST 25) , pp. 187–201 . Cited by: §1 .
- 3[3] B. Gao, Z. He, P. Sharma, Q. Kang, D. Jevdjic, J. Deng, X. Yang, Z. Yu, and P. Zuo (2024) { \{ cost-Efficient } \} large language model serving for multi-turn conversations with { \{ cachedattention } \} . In 2024 USENIX Annual Technical Conference (USENIX ATC 24) , pp. 111–126 . Cited by: §1 , §3 .
- 4[4] B. Gliwa, I. Mochol, M. Biesek, and A. Wawer (2019) SAM Sum corpus: a human-annotated dialogue dataset for abstractive summarization . ar Xiv preprint ar Xiv:1911.12237 . Cited by: §3 .
- 5[5] D. Guo, C. Xu, N. Duan, J. Yin, and J. Mc Auley (2023) Longcoder: a long-range pre-trained language model for code completion . In International Conference on Machine Learning , pp. 12098–12107 . Cited by: §3 .
- 6[6] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) Trivia QA: a large scale distantly supervised challenge dataset for reading comprehension . External Links: 1705.03551 Cited by: §3 .
- 7[7] H. Kellerer, U. Pferschy, and D. Pisinger (2004) Multidimensional knapsack problems . In Knapsack problems , pp. 235–283 . Cited by: §2 .
- 8[8] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention . In Proceedings of the 29th Symposium on Operating Systems Principles , pp. 611–626 . Cited by: §3 .
