An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems

Shervin Ghaffari; Zohre Bahranifard; and Mohammad Akbari

arXiv:2507.07061·cs.LG·July 10, 2025

An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems

Shervin Ghaffari, Zohre Bahranifard, and Mohammad Akbari

PDF

Open Access

TL;DR

This paper introduces an ensemble embedding method using multiple models and a trained meta-encoder to enhance semantic caching in LLM systems, significantly improving cache hit ratios and reducing computational costs.

Contribution

It proposes a novel ensemble embedding approach with a trained meta-encoder to better capture semantic similarities in LLM caching, outperforming single-model methods.

Findings

01

Achieved 92% cache hit ratio for semantically equivalent queries.

02

Maintained 85% accuracy in rejecting non-equivalent queries.

03

Significantly outperformed single-model approaches in semantic distinction.

Abstract

Semantic caching enhances the efficiency of large language model (LLM) systems by identifying semantically similar queries, storing responses once, and serving them for subsequent equivalent requests. However, existing semantic caching frameworks rely on single embedding models for query representation, which limits their ability to capture the diverse semantic relationships present in real-world query distributions. This paper presents an ensemble embedding approach that combines multiple embedding models through a trained meta-encoder to improve semantic similarity detection in LLM caching systems. We evaluate our method using the Quora Question Pairs (QQP) dataset, measuring cache hit ratios, cache miss ratios, token savings, and response times. Our ensemble approach achieves a 92\% cache hit ratio for semantically equivalent queries while maintaining an 85\% accuracy in correctly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Information Retrieval and Search Behavior · Topic Modeling