Dynamic Mixture-of-Experts for Incremental Graph Learning

Lecheng Kong; Theodore Vasiloudis; Seongjun Yun; Han Xie; Xiang Song

arXiv:2508.09974·cs.LG·August 14, 2025

Dynamic Mixture-of-Experts for Incremental Graph Learning

Lecheng Kong, Theodore Vasiloudis, Seongjun Yun, Han Xie, Xiang Song

PDF

3 Reviews

TL;DR

This paper introduces DyMoE, a dynamic mixture-of-experts approach for incremental graph learning that mitigates catastrophic forgetting and improves accuracy by selectively activating relevant experts, with reduced computational costs.

Contribution

The paper proposes a novel DyMoE GNN layer with specialized experts and a regularization loss, plus a sparse MoE technique to efficiently handle growing data over time.

Findings

01

Achieved 4.92% relative accuracy increase over baselines.

02

Effectively mitigates catastrophic forgetting in incremental graph learning.

03

Reduces computation time using sparse MoE with top-k experts.

Abstract

Graph incremental learning is a learning paradigm that aims to adapt trained models to continuously incremented graphs and data over time without the need for retraining on the full dataset. However, regular graph machine learning methods suffer from catastrophic forgetting when applied to incremental learning settings, where previously learned knowledge is overridden by new knowledge. Previous approaches have tried to address this by treating the previously trained model as an inseparable unit and using techniques to maintain old behaviors while learning new knowledge. These approaches, however, do not account for the fact that previously acquired knowledge at different timestamps contributes differently to learning new tasks. Some prior patterns can be transferred to help learn new data, while others may deviate from the new data distribution and be detrimental. To address this, we…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

1. This paper introduces a Dynamic Mixture-of-Expert (DyMoE) module with separate experts for each data block, allowing dynamic relevance-based information synthesis. 2. This paper proposes a block-guided loss function to minimize negative interference among experts, reducing catastrophic forgetting. 3. This paper integrates the DyMoE module into GNN layers to effectively handle data shifts in continual graph learning. 4. This paper develops a sparse DyMoE variant that focuses on the most releva

Weaknesses

1. In real-world applications, how do the dynamic changes in graph structures and data blocks affect the model's performance? Have you considered the impact of data noise and outliers on the results? 2. Experiments: 1) The MoE structure increases the number of parameters in the model, thereby enhancing its capability. In contrast, the parameter count of the baseline model is not specifically mentioned. Does this represent an unfair comparison? 2) This paper mentions "with minimal computation inc

Reviewer 02Rating 5Confidence 4

Strengths

The paper is easy to follow. The idea of using the MoE model to address graph incremental learning is novel and interesting. Experimental results on six graph incremental learning datasets demonstrate the effectiveness of the proposed DyMoE model and the block-guided regularisation loss. The results also indicate the DyMoE model can learn dedicated experts for different data blocks.

Weaknesses

1. Theorem 1 is established under the assumption that the data follow a Gaussian mixture distribution, and this assumption should be explicitly stated in the theorem to make it more precise. Can this theorem extend to data following distributions other than the Gaussian mixture distribution? 2. The details of the data balancing training procedure (line 297-301) is not very clear from the paper. Specifically, how to select the memory set for the new data block? Does this training procedure use b

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper identified the issue of existing continual learning methods that ignore the correlation between different data blocks. 2. The paper tackled a significant dynamic graph problem in real-world scenarios, where data arrives incrementally, offering a scalable solution without the need for full dataset retraining. 3. The paper developed a DyMoE module with specialized experts for each data block and introduced a data block-guided loss to reduce negative interference among the experts.

Weaknesses

1. The use of symbols is inconsistent, and the explanations lack clarity. Specifically: (a) What do m and n represent in Eq. (3)? Are they referring to the feature dimension or the number of nodes? (b) In Eq. (13), what is the specific meaning of y? Does it convey the same meaning as h? (c) In Figure 1, triangles are used on the left and circles on the bottom right to represent blocks. The authors can include a notation table or provide more explicit definitions for each symbol whe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.