Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Rickard Br\"uel-Gabrielsson; Jiacheng Zhu; Onkar Bhardwaj; Leshem Choshen; Kristjan Greenewald; Mikhail Yurochkin; Justin Solomon

arXiv:2407.00066·cs.DC·June 2, 2025

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Rickard Br\"uel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

PDF

Open Access 10 Models 5 Datasets 1 Video

TL;DR

This paper introduces a compression method for efficiently serving thousands of LoRA adapters on large language models, significantly reducing overhead while maintaining high performance.

Contribution

The authors propose a joint compression technique for LoRAs, including clustering, to enable scalable, low-overhead serving of large LoRA collections.

Findings

01

Compressed LoRAs preserve model performance.

02

Achieve over 80% of single-LoRA throughput with 1000 LoRAs.

03

Scalable clustering-based compression for large LoRA collections.

Abstract

Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead· slideslive

Taxonomy

TopicsUnderwater Vehicles and Communication Systems · IoT and Edge/Fog Computing · IoT Networks and Protocols