DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Bodon Jeong; Hongsu Byun; Youngjae Kim; Weikuan Yu; Kyungkeun Lee; Jihoon Yang; Sungyong Park

arXiv:2604.26557·cs.DC·April 30, 2026

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee, Jihoon Yang, Sungyong Park

PDF

TL;DR

DUAL-BLADE is a dual-path KV-cache offloading framework for edge LLM inference that dynamically balances memory and NVMe storage to reduce latency and improve throughput.

Contribution

It introduces a runtime adaptive framework that assigns KV tensors to page-cache or NVMe paths, enabling low-overhead direct storage access and overlapping I/O with GPU processing.

Findings

01

Reduces prefill latency by up to 33.1%

02

Decreases decode latency by up to 42.4%

03

Improves SSD utilization by 2.2x

Abstract

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, enabling low-overhead direct storage access. DUAL-BLADE further incorporates adaptive pipeline parallelism to overlap storage I/O with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.