Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs
Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, Tingzhou Yuan

TL;DR
Predictive-LoRA is a serverless inference system for LLMs that proactively manages adapter loading and memory fragmentation, significantly reducing latency and increasing throughput in real-world workloads.
Contribution
It introduces a traffic predictor and a memory management mechanism to address cold start latency and GPU fragmentation in serverless LLM inference.
Findings
Reduces cold start latency by up to 68%.
Achieves 1.52x higher throughput than S-LoRA.
Reduces Time-To-First-Token by 35% under high concurrency.
Abstract
The serverless computing paradigm offers compelling advantages for deploying Large Language Model (LLM) inference services, including elastic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start latency, and frequent adapter swapping leads to severe GPU memory fragmentation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P-LoRA introduces two key innovations: (1) a lightweight LSTM-based traffic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism inspired by operating system virtual memory,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Software System Performance and Reliability
