Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

Jungwoo Kim; Minsang Kim; Jaeheon Lee; Chanwoo Moon; Heejin Kim; Taeho Hwang; Woosuk Chung; Yeseong Kim; Sungjin Lee

arXiv:2508.18736·cs.DB·August 27, 2025

Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee

PDF

TL;DR

This paper introduces SISO, a novel semantic caching system for large language model serving that significantly improves cache hit ratios and service level compliance by redefining caching strategies beyond traditional heuristics.

Contribution

SISO employs centroid-based caching, locality-aware replacement, and dynamic thresholding to enhance efficiency and effectiveness in LLM serving systems.

Findings

01

Up to 1.71× higher cache hit ratios

02

Consistently better SLO attainment

03

Effective under diverse workloads

Abstract

Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71 $\times$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.