Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Xutong Liu; Baran Atalar; Xiangxiang Dai; Jinhang Zuo; Siwei Wang; John C.S. Lui; Wei Chen; Carlee Joe-Wong

arXiv:2508.07675·cs.LG·February 16, 2026

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong

PDF

Open Access

TL;DR

This paper introduces a learning-based framework for semantic caching in large language model serving, optimizing cache eviction strategies under uncertainty to reduce inference costs effectively.

Contribution

It develops a principled, theoretically grounded approach for semantic cache eviction, including offline and online algorithms with proven guarantees, addressing real-world uncertainties.

Findings

01

Algorithms achieve state-of-the-art performance in synthetic tests.

02

Proven efficiency and adaptability of the proposed methods.

03

Effective handling of unknown query and cost distributions.

Abstract

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Information Retrieval and Search Behavior · Web Data Mining and Analysis