KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
Huawei Zhang, Chunwei Xia, Zheng Wang

TL;DR
KVSwap is a novel software framework that enables efficient long-context inference on local devices by offloading key-value cache data to disk, overcoming memory limitations while maintaining performance.
Contribution
KVSwap introduces a disk-aware KV cache offloading method tailored for embedded and mobile devices, improving memory efficiency and throughput during long-context inference.
Findings
Higher throughput under tight memory budgets
Maintains generation quality compared to existing schemes
Effective utilization of disk storage for KV cache
Abstract
Language models (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis, which often require processing multiple long-context inputs. Running an LM locally on-device improves privacy, enables offline use, and reduces cost, but long-context inference quickly hits a \emph{memory capacity wall} as the key-value (KV) cache grows linearly with context length and batch size. Existing KV-cache offloading schemes are designed to transfer cache data from GPU memory to CPU memory; however, they are not suitable for embedded and mobile systems, where the CPU and GPU (or NPU) typically share a unified memory and the non-volatile secondary storage (disk) offers limited I/O bandwidth. We present KVSwap, a software framework tailored for local devices that achieves high memory efficiency while effectively leveraging disk storage. KVSwap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Data Storage Technologies · IoT and Edge/Fog Computing
