Marconi: Prefix Caching for the Era of Hybrid LLMs

Rui Pan; Zhuang Wang; Zhen Jia; Can Karakus; Luca Zancato; Tri Dao,; Yida Wang; Ravi Netravali

arXiv:2411.19379·cs.DC·April 11, 2025

Marconi: Prefix Caching for the Era of Hybrid LLMs

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao,, Yida Wang, Ravi Netravali

PDF

Open Access

TL;DR

Marconi introduces a novel prefix caching system tailored for hybrid large language models, significantly improving cache hit rates and reducing latency by intelligently managing cache entries based on reuse likelihood and compute savings.

Contribution

This work presents the first system supporting efficient prefix caching for hybrid LLMs, with new admission and eviction policies that enhance cache efficiency.

Findings

01

Up to 34.4× higher token hit rates

02

71.1% increase in cache hit rate

03

617 ms reduction in time-to-first-token

Abstract

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Library Science and Information Systems

MethodsSoftmax · Attention Is All You Need