InstCache: A Predictive Cache for LLM Serving
Longwei Zou, Yan Liu, Jiamu Kang, Tingfeng Liu, Jiangang Kong, Yangdong Deng

TL;DR
InstCache introduces a predictive caching system for LLM inference that leverages LLM capabilities to reorder instruction representations, significantly improving cache hit rates and reducing inference latency.
Contribution
The paper presents InstCache, a novel predictive caching mechanism that uses LLMs to enhance instruction-level cache efficiency in LLM serving systems.
Findings
Achieves 2.3x higher cache hit rate than traditional methods.
Reduces time per output token by up to 42% and 50%.
Demonstrates effectiveness on multiple datasets.
Abstract
The revolutionary capabilities of Large Language Models (LLMs) are attracting rapidly growing popularity and leading to soaring user requests to inference serving systems. Caching techniques, which leverage data reuse to reduce computation, offer opportunities to optimize the performance of LLM inference engines. On the one hand, the low-level key-value (KV) cache working at the token level is widely adopted, albeit it incurs significant overhead as request volume grows. On the other hand, instruction-level caching, which stores full instruction-response pairs, is expected to play an increasingly crucial role. However, the high variability in the content and length of instructions make it rare for identical instructions to recur within a short time window, presenting challenges for effective caching instruction-response pairs. To address this challenge, we propose InstCache, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
