MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models
Jingwei Xu, Junyu Lai, Yunpeng Huang

TL;DR
MeteoRA introduces a scalable Mixture-of-Experts framework to embed multiple task-specific LoRA adapters into large language models, enabling efficient multi-task handling and autonomous task switching during inference.
Contribution
The paper proposes MeteoRA, a novel framework that embeds multiple LoRA adapters into LLMs using MoE architecture, improving multi-task performance and adapter switching efficiency.
Findings
Achieves equivalent performance to traditional PEFT methods.
Enables LLMs to handle multiple tasks in a single inference pass.
Demonstrates superior performance in composite task scenarios.
Abstract
The pretrain+fine-tune paradigm is foundational for deploying large language models (LLMs) across various downstream applications. Within this framework, Low-Rank Adaptation (LoRA) stands out for its parameter-efficient fine-tuning (PEFT), producing numerous reusable task-specific LoRA adapters. However, this approach requires explicit task intention selection, posing challenges for autonomous task sensing and switching during inference with multiple existing LoRA adapters embedded in a single LLM. In this work, we introduce MeteoRA (Multiple-tasks embedded LoRA), a scalable and efficient framework that reuses multiple task-specific LoRA adapters into the base LLM via a full-mode Mixture-of-Experts (MoE) architecture. This framework also includes novel MoE forward acceleration strategies to address the efficiency challenges of traditional MoE implementations. Our evaluation, using the…
Peer Reviews
Decision·ICLR 2025 Poster
1. MeteoRA is a general approach to incorporate domain-specific knowledge from multiple LoRAs in a single model. 2. Extensive evaluation which demonstrates that MeteoRA performs similarly to the PEFT reference implementation, which provides a reasonable upper-bound reference. 3. The authors explain concerns about runtime and memory-efficiency. Based on this, the authors design, implement, and evaluate a CUDA kernel which addresses the concerns.
1. All LoRAs are stored in GPU memory, which limits the scalability of the approach. In contrast, S-LoRA (a LoRA serving system) scales to thousands of LoRA adapters by swapping LoRA weights to host memory. Proposing a target range for the # of LoRA adapters or a method to swap adapters to host memory could help address this concern. 2. MeteoRA model is fine-tuned on a set of LoRAs and their target domains. Consequently, the approach does not efficiently integrate new LoRA adapters. 3. Capabilit
MeteoRA effectively implements a scalable integration of LoRA while adopting forward acceleration techniques during the inference phase, thereby enhancing the efficiency of the inference process.
- The paper is not very novel, given that using MoE for LoRA is an idea that has already been extensively explored [1,2,3]. It would be beneficial to clearly delineate how MeteoRA compares to and differs from the referenced LoRAMoE works. - The term "reuse existing LoRA" is misleading and unclear; it implies the need for offline training and does not introduce any innovation compared to other MoE methods. - While the bmm-torch method for parallel processing of LoRA adapters improves forward t
- The use of a full-mode MoE architecture to integrate multiple LoRA adapters is a novel contribution, potentially addressing limitations in existing methods like Huggingface PEFT and S-LoRA. - The proposed forward acceleration strategies address efficiency challenges in traditional MoE implementations, achieving significant speedups.
- It will be better to also compare with a model trained with MoE upcycling and discuss the benefit of the proposed method. - It should be a more detailed analysis of the triton operator, how it differ from methods like S-LoRA. - The legend in Figure 3 is too small
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMixture of Experts · Balanced Selection · Adapter
