vCache: Verified Semantic Prompt Caching

Luis Gaspar Schroeder; Aditya Desai; Alejandro Cuadron; Kyle Chu; Shu Liu; Mark Zhao; Stephan Krusche; Alfons Kemper; Matei Zaharia; Joseph E. Gonzalez

arXiv:2502.03771·cs.LG·February 24, 2026

vCache: Verified Semantic Prompt Caching

Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, Joseph E. Gonzalez

PDF

Open Access 3 Reviews

TL;DR

vCache introduces a verified semantic caching system for large language models that guarantees user-defined error bounds and significantly improves cache hit rates by using an online learning algorithm to adapt similarity thresholds.

Contribution

It presents the first verified semantic cache with formal error guarantees, employing an online learning approach to optimize similarity thresholds for improved performance.

Findings

01

vCache outperforms static-threshold baselines with up to 12.5× higher cache hit rates.

02

vCache achieves up to 26× lower error rates compared to existing methods.

03

The system reliably meets user-defined error bounds across diverse benchmarks.

Abstract

Semantic caches return cached responses for semantically similar prompts to reduce LLM inference latency and cost. They embed cached prompts and store them alongside their response in a vector database. Embedding similarity metrics assign a numerical score to quantify the similarity between a request and its nearest neighbor prompt from the cache. Existing systems use the same static similarity threshold across all requests to determine whether two prompts can share similar responses. However, we observe that static thresholds do not give formal correctness guarantees, result in unexpected error rates, and lead to suboptimal cache hit rates. This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees for predictable performance. It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This study algorithmically improves from the existing threshold-based retrieval methods such as GPTCache using a novel approach. It theoretically guarantees a desired error rate and demonstrates improved performance across several metrics by introducing a verified semantic cache. 2. As an online learning algorithm, the proposed method achieves strong results without fine-tuning the embedding model, thus requiring no additional training. 3. To validate the effectiveness of the proposed methodo

Weaknesses

1. Since vCache is still a semantic caching technique, there is an insufficient evaluation of its effectiveness concerning the capability and performance across the underlying embedding model (e.g., BERT). A comparative analysis using various embedding models is needed to substantiate the "Model-Agnostic" claim, which currently lacks sufficient analysis. 2. While the paper discusses the trade-off between accuracy and cost, the benchmark results show a trade-off between Cache Hit Rate and Error R

Reviewer 02Rating 6Confidence 4

Strengths

- The paper addresses a practical challenge of the semantic prompt caching - The proposed method is well-motivated

Weaknesses

- The writing is a bit repetitive and could be streamlined

Reviewer 03Rating 6Confidence 2

Strengths

- This paper tackles on an important topic of GPT-cache and provides unique perspectiives on how to optimize this problem. Static threshold is a fundmental problem that prevents GPT-cache for better selecting cached responses. - The evaluation is holistic and abundant. The author also presents large-scale benchmarks for helping research in this field. - Author also presents good theoretical guarantees for showing the effectiveness of this method,

Weaknesses

- Dataset Generation neglects no match scenario. I checked the appendix about how the data is generated. For instance, for the SemCacheLMArena, 1 to 23 similar prompts will be generated. In reality, there could be many prompts where no similar answer in the cache can be fetched directly. I would suggest adding many prompts where no similar one variants are included should be helpful. - Though the experiments regarding error rate is abundant, more analysis of latencies and throughputs should be a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Service-Oriented Architecture and Web Services · Cognitive Computing and Networks