InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
Hongyu Chen, Letian Ruan, Zilin Xu, Yuchen Li, Xinyu Chen, Jingwen Leng, Bingsheng He, Minyi Guo, Shixuan Sun

TL;DR
InfiniLoRA is a disaggregated serving system for large language models that improves scalability and latency by decoupling LoRA execution from base-model inference, enabling higher request rates.
Contribution
The paper introduces InfiniLoRA, a novel disaggregated LoRA serving system with shared LoRA server, parallelism-aware execution, and hardware-optimized kernels, enhancing scalability and latency.
Findings
Achieves 3.05x higher request rate under latency constraints.
Increases the percentage of LoRA adapters meeting SLOs by 54%.
Demonstrates improved scalability for large language model serving.
Abstract
LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
