Marconi: Prefix Caching for the Era of Hybrid LLMs
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao,, Yida Wang, Ravi Netravali

TL;DR
Marconi introduces a novel prefix caching system tailored for hybrid large language models, significantly improving cache hit rates and reducing latency by intelligently managing cache entries based on reuse likelihood and compute savings.
Contribution
This work presents the first system supporting efficient prefix caching for hybrid LLMs, with new admission and eviction policies that enhance cache efficiency.
Findings
Up to 34.4× higher token hit rates
71.1% increase in cache hit rate
617 ms reduction in time-to-first-token
Abstract
Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Library Science and Information Systems
MethodsSoftmax · Attention Is All You Need
