MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
Andrea Manzoni

TL;DR
MoE-Sieve introduces a routing-guided approach to fine-tune only the most active experts in MoE models, significantly reducing parameters and training time while maintaining competitive performance.
Contribution
The paper proposes MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning that selectively adapts the most-routed experts, improving efficiency without sacrificing accuracy.
Findings
Selective expert tuning maintains performance within +/-1% of full LoRA.
Parameter and training time are reduced by over 70%.
Routing signal is crucial for effective expert selection.
Abstract
Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
