RAC: Relation-Aware Cache Replacement for Large Language Models
Yuchong Wu, Zihuan Xu, Wangze Ni, Peng Cheng, Lei Chen, Xuemin Lin, Heng Tao Shen, Kui Ren

TL;DR
This paper introduces RAC, a relation-aware cache replacement strategy for large language models that leverages semantic relations to improve cache hit ratios under capacity constraints.
Contribution
RAC is a novel online eviction policy that uses semantic relations among requests, outperforming existing methods in LLM caching scenarios.
Findings
RAC improves cache hit ratio by 20-30% over baselines.
RAC effectively handles long reuse distances and sparse recurrence.
Extensive evaluations confirm RAC's robustness across workloads.
Abstract
The scaling of Large Language Model (LLM) services faces significant cost and latency challenges, making effective caching under tight capacity crucial. Existing cache replacement policies, from heuristics to learning-based methods, predominantly rely on limited-window statistics such as recency and frequency. We show these signals are not robust for real-world LLM workloads, which exhibit long reuse distances and sparse local recurrence. To address these limitations, we propose Relation-Aware Cache (RAC), an online eviction strategy that leverages semantic relations among requests to guide eviction decisions. RAC synthesizes two relation-aware signals: (1) Topical Prevalence, which aggregates access evidence at the topic level to capture long-horizon reuse; and (2) Structural Importance, which leverages local intra-topic dependency structure to discriminate entries by their future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Caching and Content Delivery · Big Data and Digital Economy
