Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Chen Wang; Xunzhuo Liu; Yue Zhu; Alaa Youssef; Priya Nagpurkar; Huamin Chen

arXiv:2510.26835·cs.DB·November 3, 2025

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

PDF

Open Access

TL;DR

This paper introduces a category-aware semantic caching system for heterogeneous LLM workloads, optimizing cache policies based on query characteristics to improve hit rates and reduce search costs.

Contribution

It proposes a hybrid architecture with adaptive, category-specific caching policies and load-aware adjustments, significantly enhancing cache efficiency for diverse LLM query types.

Findings

01

Reduced miss cost from 30ms to 2ms

02

Made low-hit-rate categories economically viable

03

Decreased traffic to overloaded models by 9-17%

Abstract

LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness varies from minutes (stock data) to months (code patterns). Query repetition patterns range from power-law (code) to uniform (conversation), producing long tail cache hit rate distributions: high-repetition categories achieve 40-60% hit rates while low-repetition or volatile categories achieve 5-15% hit rates. Vector databases must exclude the long tail because remote search costs (30ms) require 15--20% hit rates to break even, leaving 20-30% of production traffic uncached. Uniform cache policies compound this problem: fixed thresholds cause false positives in dense spaces and miss valid paraphrases in sparse spaces; fixed TTLs waste memory or serve stale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cloud Computing and Resource Management