SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models
Jiaxing Li, Chi Xu, Feng Wang, Isaac M von Riedemann, Cong Zhang,, Jiangchuan Liu

TL;DR
This paper introduces SCALM, a semantic caching architecture for LLM chat services, which improves cache efficiency and reduces costs by leveraging semantic analysis, based on real-world interaction data.
Contribution
SCALM is the first cache system to incorporate semantic analysis for LLM chat services, enhancing cache hit ratios and reducing token costs.
Findings
SCALM increases cache hit ratio by 63% on average.
SCALM reduces token costs by 77%.
Semantic caching outperforms existing solutions like GPTCache.
Abstract
Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In this work, we for the first time conducted an analysis on real-world human-to-LLM interaction data, identifying key challenges in existing caching solutions for LLM-based chat services. Our findings reveal that current caching methods fail to leverage semantic connections, leading to inefficient cache performance and extra token costs. To address these issues, we propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. We also detail the implementations of the corresponding cache storage and eviction strategies. Our evaluations show that SCALM increases cache hit ratios and reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Recommender Systems and Techniques · Context-Aware Activity Recognition Systems
