Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

Yinan Ni; Xiao Yang; Yuqi Tang; Zhimin Qiu; Chen Wang; Tingzhou Yuan

arXiv:2512.20210·cs.DC·December 24, 2025

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, Tingzhou Yuan

PDF

Open Access

TL;DR

Predictive-LoRA is a serverless inference system for LLMs that proactively manages adapter loading and memory fragmentation, significantly reducing latency and increasing throughput in real-world workloads.

Contribution

It introduces a traffic predictor and a memory management mechanism to address cold start latency and GPU fragmentation in serverless LLM inference.

Findings

01

Reduces cold start latency by up to 68%.

02

Achieves 1.52x higher throughput than S-LoRA.

03

Reduces Time-To-First-Token by 35% under high concurrency.

Abstract

The serverless computing paradigm offers compelling advantages for deploying Large Language Model (LLM) inference services, including elastic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start latency, and frequent adapter swapping leads to severe GPU memory fragmentation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P-LoRA introduces two key innovations: (1) a lightweight LSTM-based traffic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism inspired by operating system virtual memory,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Software System Performance and Reliability