Improving the Serving Performance of Multi-LoRA Large Language Models   via Efficient LoRA and KV Cache Management

Hang Zhang; Jiuchen Shi; Yixiao Wang; Quan Chen; Yizhou Shan; Minyi; Guo

arXiv:2505.03756·cs.AR·May 8, 2025

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi, Guo

PDF

Open Access

TL;DR

This paper introduces FASTLIBRA, a caching system for Multi-LoRA LLMs that optimizes inference performance by managing dependencies and cache swaps, significantly reducing time-to-first-token.

Contribution

FASTLIBRA is the first system to optimize Multi-LoRA serving by dependency-aware caching and performance-driven cache swapping, improving inference speed.

Findings

01

Reduces TTFT by 63.4% on average

02

Efficiently manages LoRA and KV cache dependencies

03

Improves inference performance in Multi-LoRA LLMs

Abstract

Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques