SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Hongyao Liu; Liuqun Zhai; Junyi Wang; Zhengru Fang

arXiv:2604.21231·cs.NI·May 6, 2026

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Hongyao Liu, Liuqun Zhai, Junyi Wang, Zhengru Fang

PDF

TL;DR

SparKV is an adaptive framework that efficiently manages KV cache loading for on-device LLM inference by balancing cloud streaming and local computation, reducing latency and energy use.

Contribution

It introduces a cost-aware, runtime-refined KV loading method that improves on-device LLM inference efficiency under variable network and resource conditions.

Findings

01

Reduces Time-to-First-Token by up to 5.1x

02

Lowers per-request energy consumption by up to 3.3x

03

Maintains response quality with negligible impact

Abstract

Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.