On 10x Better Scalability: KV Stores Scale Up KV Cache
Weiping Yu, Ye Jiarui, He Mengke, Junfeng Liu, Siqiang Luo

TL;DR
This paper introduces SGLANG-LSM, a scalable KV cache system for large language models that leverages LSM-tree architecture to significantly improve cache hit rates and reduce latency.
Contribution
It applies database-inspired LSM-tree architecture to KV cache management for LLMs, addressing scalability bottlenecks of existing disk-based systems.
Findings
Up to 143% increase in cache hits
Up to 24% reduction in time-to-first-token
First systematic application of database storage architectures to LLM cache
Abstract
Large language models (LLMs) rely on Key-Value (KV) cache to reduce time-to-first-token (TTFT) latency, but existing disk-based KV cache systems using file-per-object layouts suffer from severe scalability bottlenecks due to file system metadata overhead, I/O inefficiency, and poor spatial locality. This paper presents SGLANG-LSM, a database-inspired system that leverages Log-Structured Merge-tree (LSM-tree) architectures for scalable KV cache management. SGLANG-LSM implements a layered system design with three coordinated components: (1) a prefix-preserving storage engine that maintains token sequence locality while efficiently storing large KV cache tensors through key-value separation, (2) an adaptive controller that dynamically optimizes LSM-tree configurations based on shifting workload characteristics, and (3) runtime services including batch operations and automatic resource…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is the first to apply database LSM‑tree designs to LLM KV‑cache management, including a prefix‑preserving key encoding and explicit key–value separation for tensor payloads. * The paper is well-written and has a clear layered architecture of the system. * The paper provides a concrete pseudo‑API (Appendix B, Figure 6) that shows how probe and get_batch compose with SGLang’s RadixAttention, which helps reproducibility.
* The workloads are synthetic with ten staged hit‑rate phases. Results would be more convincing with traces reflecting real-world multi‑turn chat or retrieval‑augmented traffic. * Beyond SGLang(memory) and SGLang(file), there are no comparisons to other disk‑backed KV‑cache systems (e.g., recent multi‑tier prefix stores) nor ablations teasing apart contributions of prefix‑encoding, key–value separation, dynamic compaction, and runtime services.
S1. The paper presents an interesting integration of LSM-tree storage architectures (e.g., RocksDB) into LLM serving (e.g., SGLang). This cross-domain design bridges disk storage and AI systems for improved scalability. S2. The experiments results seems to be good;
This paper raises a lot of questions about the design choices and evaluation scope of SGLANG-LSM despite its strong technical contributions. W1. Since the LSM-tree design is primarily optimized for high write throughput, it is unclear why it was chosen over read-optimized alternatives such as B-trees, especially when the main goal of LLM serving is to accelerate cache retrieval. The authors should better motivate why enabling high write throughput is critical for this workload. W2. The paper p
1. The paper correctly identifies a practical scalability problem: existing file-per-object KV-cache backends create millions of small files, leading to metadata overhead and poor I/O locality. 2. The application of LSM-tree storage to KV-cache management is relatively new. 3. The reported 24% TTFT reduction represents a measurable and relevant performance improvement for latency-sensitive LLM inference.
1. Questionable necessity of LSM for this workload. Modern KV-cache management systems (e.g., LMCache (Cheng et al., 2025)) already compute hash-chained prefix identifiers and perform O(1) lookups over content-addressed append logs. Since most KV-cache objects are immutable and written once, the motivation for adopting a full LSM-tree (with compaction and sorted order maintenance) is not well justified. 2. Incomplete evaluation details. The evaluation setup raises questions: the authors report h
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
