Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs
Tuowei Wang, Liyun Chu, Ruwen Fan, Ju Ren

TL;DR
Swarm is a system that exploits the stable co-activation patterns of KVCache entries in large language models to efficiently offload cache data across multiple SSDs, significantly improving I/O performance.
Contribution
It introduces the concept of KVCache Co-Activation and develops Swarm, a multi-SSD offloading system that enhances parallel I/O and bandwidth utilization for LLM inference workloads.
Findings
Reduces I/O time by 2.41x
Improves bandwidth utilization by 2.72x
Effectively adapts to evolving access patterns
Abstract
The key-value (KV) cache has become the dominant contributor to memory consumption in large language model (LLM) inference. Although offloading KVCache from GPU high-bandwidth memory (HBM) to CPU DRAM alleviates device memory pressure, DRAM remains capacity-limited and costly for large, persistent workloads. Solid-state drives (SSDs) provide a cost-effective alternative, but naive SSD-based paging is fundamentally bandwidth-bound due to limited PCIe throughput and per-device bandwidth constraints. In this paper, we observe that KVCache activations in real-world workloads exhibit strong and stable correlations. We term this phenomenon KVCache Co-Activation, where accessing a KV entry is often accompanied by a stable and recurring set of other KV entries. Leveraging this property, we present Swarm, an SSD-based KVCache offloading system that converts bandwidth-bound single-device access…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
