POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving
Shaoang Li, Jian Li

TL;DR
POLAR introduces an online learning approach for efficient caching and routing of LoRA adapters in edge LLM deployment, reducing latency and optimizing resource use.
Contribution
The paper formulates the cache routing problem as a two-timescale contextual bandit and proposes POLAR, a novel algorithm with theoretical guarantees and practical effectiveness.
Findings
POLAR outperforms non-adaptive baselines in experiments.
Theoretical regret bounds are established for the proposed algorithms.
Adaptive cache control significantly reduces latency in edge LLM serving.
Abstract
Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
