LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via   System-Algorithm Co-design

Rui Kong; Qiyang Li; Xinyu Fang; Qingtian Feng; Qingfeng He; Yazhu; Dong; Weijun Wang; Yuanchun Li; Linghe Kong; Yunxin Liu

arXiv:2405.17741·cs.AI·May 29, 2024

LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design

Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu, Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu

PDF

Open Access

TL;DR

LoRA-Switch introduces a token-wise routing system with optimized CUDA kernel fusion, significantly reducing inference latency of dynamic LLM adapters while maintaining accuracy improvements.

Contribution

It presents a novel token-wise routing mechanism and CUDA kernel fusion for efficient dynamic adapters in LLMs, addressing latency issues.

Findings

01

Reduces decoding latency by over 2.4 times

02

Maintains accuracy improvements of existing dynamic adapters

03

Demonstrates effectiveness on popular open-source LLMs

Abstract

Recent literature has found that an effective method to customize or further improve large language models (LLMs) is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this paper, we analyze the fine-grained costs of the dynamic adapters and find that the fragmented CUDA kernel calls are the root cause. Therefore, we propose LoRA-Switch, a system-algorithm co-designed architecture for efficient dynamic adapters. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. It switches the LoRA adapters and weights for each token and merges them into the backbone for inference. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings