Category-Aware Semantic Caching for Heterogeneous LLM Workloads
Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

TL;DR
This paper introduces a category-aware semantic caching system for heterogeneous LLM workloads, optimizing cache policies based on query characteristics to improve hit rates and reduce search costs.
Contribution
It proposes a hybrid architecture with adaptive, category-specific caching policies and load-aware adjustments, significantly enhancing cache efficiency for diverse LLM query types.
Findings
Reduced miss cost from 30ms to 2ms
Made low-hit-rate categories economically viable
Decreased traffic to overloaded models by 9-17%
Abstract
LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness varies from minutes (stock data) to months (code patterns). Query repetition patterns range from power-law (code) to uniform (conversation), producing long tail cache hit rate distributions: high-repetition categories achieve 40-60% hit rates while low-repetition or volatile categories achieve 5-15% hit rates. Vector databases must exclude the long tail because remote search costs (30ms) require 15--20% hit rates to break even, leaving 20-30% of production traffic uncached. Uniform cache policies compound this problem: fixed thresholds cause false positives in dense spaces and miss valid paraphrases in sparse spaces; fixed TTLs waste memory or serve stale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cloud Computing and Resource Management
