MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
Wenxuan Zeng, Ye Dong, Jinjin Zhou, Jin Tan, Lei Wang, Tao Wei, Runsheng Wang, Meng Li

TL;DR
MPCache introduces an MPC-friendly KV cache eviction framework that reduces latency and communication overhead in private LLM inference by selectively discarding unimportant cache entries and activating only relevant ones.
Contribution
It proposes a novel MPC-compatible KV cache eviction method combining static and dynamic algorithms with multiple optimizations, improving efficiency for private LLM inference.
Findings
Achieves 1.8 to 2.01x latency reduction
Achieves 3.39 to 8.37x communication reduction
Outperforms prior cache eviction baselines across tasks
Abstract
Private large language model (LLM) inference based on secure multi-party computation (MPC) achieves formal data privacy protection but suffers from significant latency overhead, especially for long input sequences. While key-value (KV) cache eviction and sparse attention algorithms have been proposed for efficient LLM inference in plaintext, they are not designed for MPC and cannot benefit private LLM inference directly. In this paper, we propose an accurate and MPC-friendly KV cache eviction framework, dubbed MPCache, building on the observation that historical tokens in a long sequence may have different effects on the downstream decoding. Hence, MPCache combines a look-once static eviction algorithm to discard unimportant KV cache and a query-aware dynamic selection algorithm to activate only a small subset of KV cache for attention computation. MPCache further incorporates a series…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCryptography and Data Security · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques
