Ensembles of Low-Rank Expert Adapters
Yinghao Li, Vianne Gao, Chao Zhang, MohamadAli Torkamani

TL;DR
The paper introduces ELREA, an ensemble framework of low-rank expert adapters that clusters training instructions by gradient similarity, enabling specialized adapters to improve large language model performance on diverse tasks.
Contribution
ELREA clusters training data by gradient directions and trains expert adapters on these clusters, enhancing task specialization and model performance with efficient low-rank adaptation.
Findings
ELREA outperforms baseline LoRA adapters on various domain-specific tasks.
ELREA reduces training conflicts by clustering instructions based on gradient similarity.
ELREA maintains training and inference efficiency comparable to existing methods.
Abstract
The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model's capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing…
Peer Reviews
Decision·ICLR 2025 Poster
1. The presentation is well. 2. The method section of the paper is clearly articulated, especially Figure 1, which is a clear and self-explanatory flowchart. 3. Major differences of MoE and Deep Ensembles are highlighted (P.4). Assumptions (P.5) and limitations (P.10) are clearly stated. 4. Included detailed appendices to further explain the datasets, experiment setups so the main body is not bulky.
Missing Paper Structure. The section “Related Work” should be better put in the Appendix. Limit of Novelty. This paper proposes combine two points together creatively. However, the method for data selection and partitioning (Section 3.2) directly depends on [1]. Moreover, the per-cluster fine-tuning actually has little to do with LoRA; for instance, we could also use full-parameter or soft-prompt tuning for each cluster. Therefore, the motivation for exclusive “LoRA” is not adequate, which need
1. The paper is well-written and easy to understand. 2. The combination of LoRA with gradient-based clustering for expert adapters to address the conflicting gradient issue is interesting. 3. The framework’s design is well-structured, and the authors provide extensive experimental validation to demonstrate the efficacy of ELREA. Since I am not very familiar with the latest work in the field of LLM fine-tuning, I may not be in the best position to assess the novelty of this paper. This paper is
1. As discussed by the authors in the limitations section, the computational and memory overhead of ELREA is substantial. 2. The paper lacks an in-depth discussion on the robustness of the clusters and the impact of the clustering algorithm (BIRCH) on performance. Since clustering is central to ELREA, variations in cluster formation could significantly affect adapter performance.
1. The authors explained the challenges of fine-tuning large models from the perspective of gradient conflicts, which is a novel approach. They then propose clustering training samples based on their gradients and training them separately using LoRA. By integrating multiple LoRA adapters, they enhance the performance of LoRA fine-tuning. 2. The authors’ writing structure is well-organized and the text is clear and easy to understand. Additionally, the research methodology is rigorous, with deta
1. The authors did not discuss the complexity of the method. Compared to the vanilla LoRA, this approach introduces a clustering process and the training of multiple LoRAs, which inevitably increases the complexity of training. Additionally, computing different LoRA weight combinations for each sample could also lead to a decline in inference performance. Although the author's method does indeed improve accuracy, whether it is practical in real-world scenarios requires further discussion. 2. Th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications
