An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems
Shervin Ghaffari, Zohre Bahranifard, and Mohammad Akbari

TL;DR
This paper introduces an ensemble embedding method using multiple models and a trained meta-encoder to enhance semantic caching in LLM systems, significantly improving cache hit ratios and reducing computational costs.
Contribution
It proposes a novel ensemble embedding approach with a trained meta-encoder to better capture semantic similarities in LLM caching, outperforming single-model methods.
Findings
Achieved 92% cache hit ratio for semantically equivalent queries.
Maintained 85% accuracy in rejecting non-equivalent queries.
Significantly outperformed single-model approaches in semantic distinction.
Abstract
Semantic caching enhances the efficiency of large language model (LLM) systems by identifying semantically similar queries, storing responses once, and serving them for subsequent equivalent requests. However, existing semantic caching frameworks rely on single embedding models for query representation, which limits their ability to capture the diverse semantic relationships present in real-world query distributions. This paper presents an ensemble embedding approach that combines multiple embedding models through a trained meta-encoder to improve semantic similarity detection in LLM caching systems. We evaluate our method using the Quora Question Pairs (QQP) dataset, measuring cache hit ratios, cache miss ratios, token savings, and response times. Our ensemble approach achieves a 92\% cache hit ratio for semantically equivalent queries while maintaining an 85\% accuracy in correctly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Information Retrieval and Search Behavior · Topic Modeling
