Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

Shashwat Jaiswal; Shrikara Arun; Anjaly Parayil; Ankur Mallick; Spyros Mastorakis; Alind Khare; Chloi Alverti; Renee St Amant; Chetan Bansal; Victor R\"uhle; Josep Torrellas

arXiv:2511.22880·cs.DC·December 1, 2025

Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor R\"uhle, Josep Torrellas

PDF

Open Access

TL;DR

LoRAServe is a dynamic, workload-aware system that efficiently manages heterogeneous LoRA adapters in distributed LLM inference, significantly improving throughput and latency while reducing GPU usage.

Contribution

It introduces LoRAServe, a novel framework for dynamic adapter placement and routing that addresses rank heterogeneity in LoRA serving systems.

Findings

01

Up to 2× higher throughput achieved.

02

Up to 9× lower tail latency (TTFT).

03

Uses up to 50% fewer GPUs under SLO constraints.

Abstract

Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs), enabling rapid adaptation to diverse domains. In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model. However, state-of-the-art serving systems co-batch heterogeneous adapters without accounting for rank (size) variability, leading to severe performance skew, which ultimately requires adding more GPUs to satisfy service-level objectives (SLOs). Existing optimizations, focused on loading, caching, and kernel execution, ignore this heterogeneity, leaving GPU resources underutilized. We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving. By dynamically rebalancing adapters across GPUs and leveraging GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · IoT and Edge/Fog Computing · Parallel Computing and Optimization Techniques