LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design
Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu, Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu

TL;DR
LoRA-Switch introduces a token-wise routing system with optimized CUDA kernel fusion, significantly reducing inference latency of dynamic LLM adapters while maintaining accuracy improvements.
Contribution
It presents a novel token-wise routing mechanism and CUDA kernel fusion for efficient dynamic adapters in LLMs, addressing latency issues.
Findings
Reduces decoding latency by over 2.4 times
Maintains accuracy improvements of existing dynamic adapters
Demonstrates effectiveness on popular open-source LLMs
Abstract
Recent literature has found that an effective method to customize or further improve large language models (LLMs) is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this paper, we analyze the fine-grained costs of the dynamic adapters and find that the fragmented CUDA kernel calls are the root cause. Therefore, we propose LoRA-Switch, a system-algorithm co-designed architecture for efficient dynamic adapters. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. It switches the LoRA adapters and weights for each token and merges them into the backbone for inference. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Embedded Systems Design Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
