FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting

Fan Liu; Bikang Pan; Zhongyi Wang; Xi Yao; Xiaoying Tang; Jingya Wang; Ye Shi

arXiv:2506.00965·cs.AI·October 8, 2025

FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting

Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

PDF

Open Access 4 Reviews

TL;DR

FLEx introduces a personalized federated learning framework for Mixture-of-Experts LLMs, enabling efficient personalization and knowledge preservation through expert grafting and selective parameter aggregation.

Contribution

The paper proposes FLEx, a novel federated learning approach that leverages expert grafting and selective aggregation to personalize MoE-based LLMs while preserving pretrained knowledge.

Findings

01

Outperforms federated baselines on diverse datasets

02

Reduces communication overhead by aggregating only shared parameters

03

Maintains strong knowledge retention on MMLU benchmark

Abstract

Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE's dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper tackles an important and underexplored problem of enabling personalization for large MoE-based LLMs in federated settings. 2. The motivation (preserving pretrained knowledge and avoiding high communication cost) is sound and practically relevant. 3. The overall framework is well-written and supported with some empirical evaluation.

Weaknesses

1. The term “expert grafting” is semantically misleading. In biology, grafting refers to physically attaching a part of one organism onto another to form an integrated system, such as the work FedGraft [R1]. In this work, however, the method only selects or prunes pretrained experts and attaches a small adapter for local personalization. This is conceptually closer to adapter-based personalization or lightweight expert extension, rather than genuine “grafting”. Using an inaccurate metaphor may c

Reviewer 02Rating 2Confidence 4

Strengths

S1. The paper identifies a critical and highly practical problem. The naive application of standard FL to massive MoE models is untenable due to communication overhead and knowledge corruption. The paper's core strategy--decoupling the aggregation of shared non-expert parameters from the personalization of frozen experts--is an elegant and effective solution to this problem. S2. The "expert grafting" mechanism for personalization is a clever and efficient approach. Instead of training a new, ra

Weaknesses

W1. The framework's core design--aggregating only non-expert parameters while freezing all pretrained experts--prevents the model from collaboratively learning new, shared knowledge within its most critical components. In Transformer architectures, the expert layers (FFNs) are the primary location for knowledge storage, whereas the aggregated attention layers mainly handle information routing. By freezing all experts, the FL process is blocked from updating the model's core "knowledge" stores wi

Reviewer 03Rating 4Confidence 4

Strengths

- **Clarity and Structure:** The paper is clearly written and well-organized. Figures and algorithmic descriptions are intuitive, enhancing the accessibility of the methodological exposition. - **Reproducibility:** The authors have provided code and implementation details, which supports reproducibility and validation of the reported results.

Weaknesses

- **Potential Representation–Routing Misalignment:** The design updates dense non-expert layers through global aggregation while keeping all pretrained experts and their routers frozen. This raises concerns about potential misalignment between the evolving feature representations and the static routing mechanism, which could lead to suboptimal or unstable expert activation, especially under non-IID data distributions. The paper lacks analysis—such as expert utilization statistics or routing entr

Reviewer 04Rating 6Confidence 5

Strengths

1) Effective Solution to MoE-FL: Successfully utilizes MoE sparsity to reduce communication overhead and leverage the frozen experts to prevent catastrophic forgetting. 2) Strong Performance Gains: Demonstrates significant improvements over standard FL baselines, particularly under challenging pathological non-IID settings. 3) Knowledge Preservation: MMLU results strongly support the efficacy of freezing experts for retaining general world knowledge.

Weaknesses

1) Greedy Selection Limitation: The expert grafting strategy relies on a greedy selection process that picks only a single expert for personalization. While effective, this approach does not explore the potential gains—or added complexity—of using multiple experts or more sophisticated selection mechanisms. 2) Lack of Communication Cost Measurement: Although reduced communication is presented as a key advantage, the paper does not provide concrete empirical evidence—such as tables or numerical c

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques