Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Asmit Kumar Singh; Haozhe Wang; Laxmi Naga Santosh Attaluri; Tak Chiam; Weihua Zhu

arXiv:2602.13165·cs.IR·March 16, 2026

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu

PDF

Open Access

TL;DR

Krites is an asynchronous caching policy for tiered LLM architectures that uses LLM judgments to expand static cache coverage without increasing latency, significantly reducing inference costs.

Contribution

It introduces an LLM-judged asynchronous verification mechanism that enhances static cache effectiveness in tiered LLM systems without affecting response latency.

Findings

01

Increases static cache hit rate by up to 3.9 times in simulations.

02

Maintains unchanged critical path latency.

03

Improves reuse of static responses in conversational and search workloads.

Abstract

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce Krites, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Caching and Content Delivery · Scientific Computing and Data Management