TL;DR
This paper introduces block-diagonal LoRA, a sharding strategy for tensor parallel serving of multiple LoRA adapters that reduces communication overhead and significantly speeds up inference without sacrificing parameter efficiency.
Contribution
The paper proposes a novel block-diagonal LoRA method that eliminates communication overhead in tensor parallel LoRA serving, improving speed while maintaining parameter efficiency.
Findings
Achieves up to 1.79x end-to-end speed-up on 8 GPUs.
Maintains similar downstream performance as standard LoRA.
Reduces communication overhead in multi-device LoRA serving.
Abstract
When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
