ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Yifan Sui; Hao Wang; Hanfei Yu; Yitao Hu; Jianxun Li; Hao Wang

arXiv:2505.14468·cs.LG·May 21, 2025

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Yifan Sui, Hao Wang, Hanfei Yu, Yitao Hu, Jianxun Li, Hao Wang

PDF

Open Access

TL;DR

ServerlessLoRA is a system that significantly reduces latency and costs in serverless inference of LoRA-adapted LLMs by sharing resources, pre-loading artifacts, and managing GPU contention.

Contribution

It introduces a novel serverless inference system tailored for LoRA-based LLMs, addressing redundancy, latency, and resource contention issues.

Findings

01

Reduces Time-To-First-Token by up to 86%

02

Cuts monetary costs by up to 89%

03

Improves efficiency of LoRA inference in serverless environments

Abstract

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Cloud Data Security Solutions · Privacy-Preserving Technologies in Data

MethodsAttentive Walk-Aggregating Graph Neural Network