Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong

TL;DR
This paper introduces a learning-based framework for semantic caching in large language model serving, optimizing cache eviction strategies under uncertainty to reduce inference costs effectively.
Contribution
It develops a principled, theoretically grounded approach for semantic cache eviction, including offline and online algorithms with proven guarantees, addressing real-world uncertainties.
Findings
Algorithms achieve state-of-the-art performance in synthetic tests.
Proven efficiency and adaptability of the proposed methods.
Effective handling of unknown query and cost distributions.
Abstract
Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Information Retrieval and Search Behavior · Web Data Mining and Analysis
