Generative Caching for Structurally Similar Prompts and Responses
Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta

TL;DR
This paper introduces heirname{}, a generative caching method that leverages structural similarities in prompts to improve response reuse, significantly enhancing cache hit rates and reducing latency in large language model workflows.
Contribution
It presents a novel generative cache that identifies reusable response patterns across structurally similar prompts, improving caching effectiveness over exact or semantic matching.
Findings
Achieves 83% cache hit rate on datasets with prompt repetition.
Improves cache hit rate by approximately 20% in agentic workflows.
Reduces end-to-end execution latency by around 34%.
Abstract
Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBusiness Process Modeling and Analysis · Software System Performance and Reliability · Machine Learning in Materials Science
