FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting
Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

TL;DR
FLEx introduces a personalized federated learning framework for Mixture-of-Experts LLMs, enabling efficient personalization and knowledge preservation through expert grafting and selective parameter aggregation.
Contribution
The paper proposes FLEx, a novel federated learning approach that leverages expert grafting and selective aggregation to personalize MoE-based LLMs while preserving pretrained knowledge.
Findings
Outperforms federated baselines on diverse datasets
Reduces communication overhead by aggregating only shared parameters
Maintains strong knowledge retention on MMLU benchmark
Abstract
Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE's dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper tackles an important and underexplored problem of enabling personalization for large MoE-based LLMs in federated settings. 2. The motivation (preserving pretrained knowledge and avoiding high communication cost) is sound and practically relevant. 3. The overall framework is well-written and supported with some empirical evaluation.
1. The term “expert grafting” is semantically misleading. In biology, grafting refers to physically attaching a part of one organism onto another to form an integrated system, such as the work FedGraft [R1]. In this work, however, the method only selects or prunes pretrained experts and attaches a small adapter for local personalization. This is conceptually closer to adapter-based personalization or lightweight expert extension, rather than genuine “grafting”. Using an inaccurate metaphor may c
S1. The paper identifies a critical and highly practical problem. The naive application of standard FL to massive MoE models is untenable due to communication overhead and knowledge corruption. The paper's core strategy--decoupling the aggregation of shared non-expert parameters from the personalization of frozen experts--is an elegant and effective solution to this problem. S2. The "expert grafting" mechanism for personalization is a clever and efficient approach. Instead of training a new, ra
W1. The framework's core design--aggregating only non-expert parameters while freezing all pretrained experts--prevents the model from collaboratively learning new, shared knowledge within its most critical components. In Transformer architectures, the expert layers (FFNs) are the primary location for knowledge storage, whereas the aggregated attention layers mainly handle information routing. By freezing all experts, the FL process is blocked from updating the model's core "knowledge" stores wi
- **Clarity and Structure:** The paper is clearly written and well-organized. Figures and algorithmic descriptions are intuitive, enhancing the accessibility of the methodological exposition. - **Reproducibility:** The authors have provided code and implementation details, which supports reproducibility and validation of the reported results.
- **Potential Representation–Routing Misalignment:** The design updates dense non-expert layers through global aggregation while keeping all pretrained experts and their routers frozen. This raises concerns about potential misalignment between the evolving feature representations and the static routing mechanism, which could lead to suboptimal or unstable expert activation, especially under non-IID data distributions. The paper lacks analysis—such as expert utilization statistics or routing entr
1) Effective Solution to MoE-FL: Successfully utilizes MoE sparsity to reduce communication overhead and leverage the frozen experts to prevent catastrophic forgetting. 2) Strong Performance Gains: Demonstrates significant improvements over standard FL baselines, particularly under challenging pathological non-IID settings. 3) Knowledge Preservation: MMLU results strongly support the efficacy of freezing experts for retaining general world knowledge.
1) Greedy Selection Limitation: The expert grafting strategy relies on a greedy selection process that picks only a single expert for personalization. While effective, this approach does not explore the potential gains—or added complexity—of using multiple experts or more sophisticated selection mechanisms. 2) Lack of Communication Cost Measurement: Although reduced communication is presented as a key advantage, the paper does not provide concrete empirical evidence—such as tables or numerical c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques
