KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference

Huawei Zhang; Chunwei Xia; Zheng Wang

arXiv:2511.11907·cs.DC·December 15, 2025

KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference

Huawei Zhang, Chunwei Xia, Zheng Wang

PDF

Open Access

TL;DR

KVSwap is a novel software framework that enables efficient long-context inference on local devices by offloading key-value cache data to disk, overcoming memory limitations while maintaining performance.

Contribution

KVSwap introduces a disk-aware KV cache offloading method tailored for embedded and mobile devices, improving memory efficiency and throughput during long-context inference.

Findings

01

Higher throughput under tight memory budgets

02

Maintains generation quality compared to existing schemes

03

Effective utilization of disk storage for KV cache

Abstract

Language models (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis, which often require processing multiple long-context inputs. Running an LM locally on-device improves privacy, enables offline use, and reduces cost, but long-context inference quickly hits a \emph{memory capacity wall} as the key-value (KV) cache grows linearly with context length and batch size. Existing KV-cache offloading schemes are designed to transfer cache data from GPU memory to CPU memory; however, they are not suitable for embedded and mobile systems, where the CPU and GPU (or NPU) typically share a unified memory and the non-volatile secondary storage (disk) offers limited I/O bandwidth. We present KVSwap, a software framework tailored for local devices that achieves high memory efficiency while effectively leveraging disk storage. KVSwap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Data Storage Technologies · IoT and Edge/Fog Computing