Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics
Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee

TL;DR
This paper introduces SISO, a novel semantic caching system for large language model serving that significantly improves cache hit ratios and service level compliance by redefining caching strategies beyond traditional heuristics.
Contribution
SISO employs centroid-based caching, locality-aware replacement, and dynamic thresholding to enhance efficiency and effectiveness in LLM serving systems.
Findings
Up to 1.71× higher cache hit ratios
Consistently better SLO attainment
Effective under diverse workloads
Abstract
Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71 higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
