PiKV: KV Cache Management System for Mixture of Experts

Dong Liu; Yanxuan Yu; Ben Lengerich; Ying Nian Wu

arXiv:2508.06526·cs.DC·May 20, 2026

PiKV: KV Cache Management System for Mixture of Experts

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu

PDF

1 Repo

TL;DR

PiKV is an open-source distributed KV cache system designed to optimize memory and communication efficiency for Mixture of Experts models in large-scale language model inference.

Contribution

PiKV introduces expert-sharded storage, routing, scheduling, and compression techniques to improve KV cache management in MoE architectures.

Findings

01

Reduces memory usage through compression modules.

02

Improves cache access efficiency with expert sharding and routing.

03

Open-source implementation available at GitHub.

Abstract

As large-scale language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NoakLiu/PiKV
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Caching and Content Delivery