FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts
Xinyi Wang, Lirong Gao, Haobo Wang, Yiming Zhang, Junbo Zhao

TL;DR
FLoE introduces a Fisher-guided, sparse layer selection and Bayesian rank optimization for efficient, task-specific adaptation of large language models, reducing redundancy and improving resource efficiency.
Contribution
FLoE presents a novel importance scoring and automatic rank allocation method for sparse, efficient fine-tuning of LLMs, surpassing uniform adaptation approaches.
Findings
FLoE achieves better efficiency-accuracy trade-offs across multiple benchmarks.
It significantly reduces parameter redundancy compared to uniform PEFT methods.
FLoE adapts effectively in resource-constrained environments.
Abstract
Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a widely adopted strategy for adapting pre-trained Large Language Models (LLMs) to downstream tasks, significantly reducing memory and computational costs. However, most existing PEFT techniques uniformly deploy LoRA adapters across all layers, disregarding the intrinsic heterogeneity of layer contributions and task-specific rank requirements. This uniform paradigm leads to redundant parameter allocation and suboptimal adaptation efficiency. To address these limitations, we propose FLoE, a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Finding the layer importance and the optimal ranks for the adapters is a challenging problem and this paper provides a new viewpoint using Bayesian optimization and Fisher information to solve this problem. - The experiments are quite comprehensive and show pretty good results.
- First, for the Bayesian optimization step, the paper missed out the following paper which also uses Bayesian optimization for finding the optimal ranks in LoRA (though not MoE + LoRA) [1] AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning - For optimizing the mask: - The paper doesn’t seem to consider the problem that the mask requires the same number of parameters as the model (unless I’m missing something here). So what confuses me here is what makes this part mo
- The experimental setting of this paper is solid, and its performance surpasses several baselines. - This paper provides an automated and theoretically-grounded solution for layer selection, addressing a practical challenge in PEFT.
- The method has limitations as it primarily focuses on the LoRA-MoE scenario. While this appears fancy, it is not the primary application context for standard LoRA. - A comparison with relevant work is lacking. Many studies are optimizing LoRA from an efficiency perspective, with some focusing on rank reduction (e.g., AdaLoRA, which was included) and others on parameter sharing. The baselines compared in this paper could be more comprehensive to better situate the work. Some related works: ht
- FLoE tackles both redundancy and inefficiency in current PEFT approaches. - The study provides extensive baseline comparisons and detailed latency/runtime studies.
- The integration of layer selection and adaptive rank tuning is nice, but each component has been explored on its own in prior studies. - The method leans on a proxy dataset, but there is no ablation on the proxy. - The layer-importance scoring hinges on Fisher information computed from pretraining loss sensitivity, implicitly assuming access to reliable pretraining gradients.
- **Principled Layer Selection**: FLoE’s utilization of Fisher information to guide which layers receive adapters is theoretically motivated and systematically implemented. This is a marked improvement over uniform or brute-force selection. - **Dynamic Rank Allocation**: The inclusion of Bayesian optimization for LoRA rank and expert allocation addresses a significant practical bottleneck: the need for extensive hyperparameter tuning. - **Comprehensive Experiments**: Evaluations span multiple
- Absence of “Ablation” for Fisher vs. Other Importance Measures: The methodology justifies Fisher information–based selection, but there is no systematic comparison against alternative (e.g., gradient norm, attention-based, or purely empirical) layer-importance metrics. The ablation in Table 7 focuses on the mask refinement step rather than the core Fisher metric. This omission prevents a full understanding of why FLoE’s particular mechanism is essential. Even simpler ablations methods like the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFlow Measurement and Analysis · Speech and Audio Processing · Anomaly Detection Techniques and Applications
