Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
Rickard Br\"uel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

TL;DR
This paper introduces a compression method for efficiently serving thousands of LoRA adapters on large language models, significantly reducing overhead while maintaining high performance.
Contribution
The authors propose a joint compression technique for LoRAs, including clustering, to enable scalable, low-overhead serving of large LoRA collections.
Findings
Compressed LoRAs preserve model performance.
Achieve over 80% of single-LoRA throughput with 1000 LoRAs.
Scalable clustering-based compression for large LoRA collections.
Abstract
Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task280model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task190model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task391model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task290model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task1391model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task1342model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task442model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task620model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task1598model
- 🤗Lots-of-LoRAs/Mistral-7B-Instruct-v0.2-4b-r16-task039model
- Lots-of-LoRAs/task816_pawsx_japanese_spanish_translationdataset· 8 dl8 dl
- Lots-of-LoRAs/task809_pawsx_chinese_french_translationdataset· 22 dl22 dl
- Lots-of-LoRAs/task045_miscellaneous_sentence_paraphrasingdataset· 81 dl81 dl
- Lots-of-LoRAs/task588_amazonfood_rating_classificationdataset· 51 dl51 dl
- Lots-of-LoRAs/task461_qasper_question_generationdataset· 54 dl54 dl
Videos
Taxonomy
TopicsUnderwater Vehicles and Communication Systems · IoT and Edge/Fog Computing · IoT Networks and Protocols
