DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Yuhan Liu; Yuyang Huang; Jiayi Yao; Shaoting Feng; Zhuohan Gu; Kuntai Du; Hanchen Li; Yihua Cheng; Junchen Jiang; Shan Lu; Madan Musuvathi; Esha Choukse

arXiv:2411.02820·cs.MA·July 16, 2025·2 cites

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse

PDF

Open Access

TL;DR

DroidSpeak is a distributed system that enables sharing of key-value caches across different large language models with the same architecture, significantly improving inference throughput with minimal quality loss.

Contribution

It introduces the first method for cross-model KV cache sharing in distributed LLM inference, including a selective recomputation approach to maintain quality.

Findings

01

Up to 4x throughput improvement

02

Approximately 3.1x faster prefill time

03

Negligible quality loss in various metrics

Abstract

Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question. We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Network Security and Intrusion Detection