IC-Cache: Efficient Large Language Model Serving via In-context Caching
Yifan Yu, Yu Gan, Nikhil Sarda, Lillian Tsai, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler

TL;DR
IC-Cache is a caching system that enhances large language model serving efficiency by leveraging historical request-response pairs to enable smaller models to imitate larger ones, reducing cost and latency.
Contribution
The paper introduces IC-Cache, a novel caching system that uses in-context examples to improve small LLMs' performance and efficiency at scale.
Findings
Increases serving throughput by up to 5.9x.
Reduces latency by up to 71%.
Maintains response quality despite efficiency improvements.
Abstract
Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 70% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge transfer among requests. However, naively caching and reusing past responses leads to a big quality drop. In this paper, we introduce IC-Cache, a caching system that enables live LLM capability augmentation to improve serving efficiency: by leveraging historical request-response pairs from larger models as in-context examples, IC-Cache empowers small LLMs to imitate and even exceed the compositional abilities (e.g., reasoning) of their larger counterparts, enabling selective offloading of requests to reduce cost and latency. Achieving this live augmentation at scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
