MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning
Junjie Wang, Guangjing Yang, Wentao Chen, Huahui Yi, Xiaohu Wu,, Zhouchen Lin, Qicheng Lao

TL;DR
MLAE introduces a masking-based approach to enhance parameter-efficient fine-tuning of large models by increasing diversity and independence among low-rank matrices, leading to state-of-the-art results with fewer parameters.
Contribution
The paper proposes Masked LoRA Experts (MLAE), a novel method that decomposes low-rank matrices into independent experts and uses masking to improve diversity and performance.
Findings
Achieves 78.8% accuracy on VTAB-1k benchmark.
Achieves 90.9% accuracy on FGVC benchmark.
Surpasses previous SOTA by 0.8% with fewer parameters.
Abstract
In response to the challenges posed by the extensive parameter updates required for full fine-tuning of large-scale pre-trained models, parameter-efficient fine-tuning (PEFT) methods, exemplified by Low-Rank Adaptation (LoRA), have emerged. LoRA simplifies the fine-tuning process but may still struggle with a certain level of redundancy in low-rank matrices and limited effectiveness from merely increasing their rank. To address these issues, a natural idea is to enhance the independence and diversity of the learning process for the low-rank matrices. Therefore, we propose Masked LoRA Experts (MLAE), an innovative approach that applies the concept of masking to visual PEFT. Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices, or "experts", thus enhancing independence. Additionally, we introduce a binary mask…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper is clearly presented and well-organized. The authors also provide a detailed discussion of related works and variants. 2. MLAE compares multiple masking strategies, including fixed, random, and mixed masking, and presents detailed experimental results for reference. 3. MLAE shows commendable performance across multiple datasets, surpassing GLoRA on most datasets involved.
1. Two main components, the mix of LoRA experts and MoE dropout, have been discussed in previous studies [1,2], thereby limiting the novelty and technical contribution of MLAE. 2. Since MLAE shares some similarities with recent related competitive baselines, both the differences and performance should be compared between MLAE and these works (e.g., IncreLoRA) 3. Given that this article proposes a task-independent LoRA improvement, it would be better to perform comparisons with existing LoRA-base
+ While the LoRA framework has been used within the mixture of experts (MoE) before, the use of masking at the expert level sounds novel. The authors demonstrate how the adaptive dropout technique combined with cellular decomposition effectively addresses redundancy. + The authors provide extensive experiments across 24 tasks covering both general vision tasks and fine-grained classification. + The paper is well-organized and easy to follow.
- Regarding Masking Configurations: While stochastic masking performed best, the analysis could be expanded by exploring additional dropout probabilities or scheduling strategies to better understand how different levels of sparsity affect each dataset type (e.g., specialized vs. structured). - Computational Costs: The authors briefly mention longer training times for MLAE due to the introduction of dropout but do not fully explore how significant this increase is. Adding a thorough analysis of
1.The paper introduces a dropout mechanism specifically for LoRA adapters, applying rank-wise dropout to improve model generalizability. 2.The concept of treating each rank in the LoRA adapter as an independent expert is innovative, providing a fresh perspective on parameter-efficient fine-tuning. 3.The paper is well-structured and easy to follow, making the proposed methodology accessible and clear.
1.The claims regarding the use of cellular decomposition to impose independence and diversity constraints (lines 18-21, 161) lack clarity. Specifically, there is no explicit mention of how independence among the 8$\times$rank-1 LoRA matrices is achieved or how it is better than a rank-8 LoRA. Additionally, it’s not evident how using 8$\times$rank-1 LoRAs differs from a single Rank-8 LoRA, as rank 8 LoRA can be viewed as concatenate row and column vectors of the 8 rank-1 LoRAs. This raises questi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Neural Networks and Applications
MethodsDropout
