InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Hongyu Chen; Letian Ruan; Zilin Xu; Yuchen Li; Xinyu Chen; Jingwen Leng; Bingsheng He; Minyi Guo; Shixuan Sun

arXiv:2604.07173·cs.DC·April 9, 2026

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Hongyu Chen, Letian Ruan, Zilin Xu, Yuchen Li, Xinyu Chen, Jingwen Leng, Bingsheng He, Minyi Guo, Shixuan Sun

PDF

TL;DR

InfiniLoRA is a disaggregated serving system for large language models that improves scalability and latency by decoupling LoRA execution from base-model inference, enabling higher request rates.

Contribution

The paper introduces InfiniLoRA, a novel disaggregated LoRA serving system with shared LoRA server, parallelism-aware execution, and hardware-optimized kernels, enhancing scalability and latency.

Findings

01

Achieves 3.05x higher request rate under latency constraints.

02

Increases the percentage of LoRA adapters meeting SLOs by 54%.

03

Demonstrates improved scalability for large language model serving.

Abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05 \times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.