Theory on Mixture-of-Experts in Continual Learning
Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, Ness B. Shroff

TL;DR
This paper provides the first theoretical analysis of Mixture-of-Experts in continual learning, demonstrating how MoE can mitigate catastrophic forgetting and improve learning performance through expert specialization and load balancing.
Contribution
It offers a novel theoretical framework for MoE in continual learning, analyzing expert specialization, gating dynamics, and convergence, with extensions to deep neural networks.
Findings
MoE diversifies experts to specialize in different tasks
Gating network should be terminated after sufficient training
Adding more experts may delay convergence without performance gains
Abstract
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right…
Peer Reviews
Decision·ICLR 2025 Spotlight
**Originality**: - the idea that forgetting in MoE can be mitigated solely through specialized experts and correct routing is not entirely new, e.g. see [1] - this paper is original in its theoretical contribution (to the best of my knowledge), providing proofs and bounds on Cl metrics with MoEs - the proposed locality loss and load balancing loss provide clear mechanisms for task clustering and specialization (even though load balancing loss is not a contribution of this work) **Quality**: I
- the scale of the experiments is small, while one has to uknowledge that the contributions are mainly theoretical - I would appreciate more intuitive lingo and explanations - the current implementation is essentially on the one extreme of the parameter sharing trade-off where no transfer happens between tasks? - it is not exactly clear how these ideas can be extended to large scale MoE with multiple expert layers and per-token routing, where correct routing as well as expert specialization is
1) Theoretical Foundations: The paper provides a comprehensive theoretical analysis of MoE in the context of continual learning, establishing clear benefits over single-expert models through explicit expressions for expected forgetting and generalization error. 2) Load Balancing: The model ensures balanced utilization of experts, which can lead to improved generalization performance as it reduces the risk of overloading any single expert with too many tasks. 3) Empirical Validation: Experiment
1) Validity of Proposition (4): The model gap term ∑n≠n′∥wn−wn′∥2\sum_{n \neq n'} \|w_n - w_{n'}\|^2∑n=n′∥wn−wn′∥2 only considers the Euclidean distance between weights. This may not fully capture the complex relationships between tasks. In practice, tasks could overlap in non-trivial ways (e.g., in feature space or output space), and simple weight differences do not reflect true "task divergence" accurately. 2) Limited Experiments: Though the main contribution is to present the theoretical
This work is written clearly and well structured. Overall, the theoretical analysis over MoEs is necessary in developing MoE-based large models, and this work discusses some insights into the catastrophic forgetting and generalization. The experiments include the overparameterized linear regression and MNIST cases.
(1) The scope of this work seems a bit wide according to the title. I suggest to use terms like “theoretical understanding” to modify. (2) The theoretical analysis is mainly on overparameterized linear regression cases, which might be a limitation in this work as nonlinear deep neural network cases can be more practical. (3) There are some related work on MoE theories, continual learning with MoEs or MoEs for adaptation in the field that require discussions [1-4]. Reference: [1] Nguyen H D,
Code & Models
Videos
Taxonomy
TopicsProblem and Project Based Learning
MethodsSoftmax · Attention Is All You Need · Mixture of Experts · Linear Regression
