Low-Latency Edge LLM Handover via Joint KV Cache Transfer and Token Prefill
Seunghun Lee, Jihong Park, Ce Zheng, and Hyuncheol Park

TL;DR
This paper introduces a unified handover scheme for edge deployment of large language models that minimizes latency during user mobility by jointly optimizing KV cache transfer and token prefill.
Contribution
It proposes a novel joint selection and scheduling method for KV cache transfer and token prefill to reduce handover delay in edge LLM services.
Findings
Outperforms baseline methods across various backhaul capacities.
Provides a tractable solution with explicit feasibility conditions.
Offers practical guidelines for mobility-aware Edge LLM token streaming.
Abstract
Edge deployment of large language models (LLMs) can reduce latency for interactive services, but mobility introduces service interruptions when an user equipment (UE) hands over between base stations (BSs). To promptly resume decoding, the target-side edge server must recover the UE context state, which can be provisioned either by token forwarding followed by prefill computation or by direct key-value (KV) cache transmission over backhaul. This paper proposes a unified handover (HO) design that jointly selects the prefill length and schedules backhaul KV cache delivery to minimize the worst-user LLM HO delay for multiple UEs. The resulting scheme admits a tractable step-wise solution with explicit feasibility conditions and a constructive rate-scheduling policy. Simulations show that the proposed method consistently outperforms baselines across a wide range of backhaul capacities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
