Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting
Zhenliang Ni, Xiaowen Ma, Zhenkai Wu, Shuai Xiao, Han Shu, Xinghao Chen

TL;DR
Ada-MoGE is an adaptive Gaussian Mixture of Experts model for time series forecasting that dynamically adjusts the number of experts based on spectral analysis, improving accuracy and efficiency.
Contribution
The paper introduces Ada-MoGE, which adaptively determines the number of experts using spectral information, addressing frequency shift issues in time series forecasting.
Findings
Achieves state-of-the-art performance on six benchmarks.
Uses only 0.2 million parameters for high accuracy.
Effectively handles frequency shifts with adaptive expert allocation.
Abstract
Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper is motivated by the observation that existing mixture-of-experts models may suffer from a frequency imbalance issue. While this is an interesting and relevant motivation, the proposed method does not appear to address the problem in a clear or rigorous manner. 2. The proposed approach seems to be compatible with existing methods, which allows for evaluation in terms of the incremental benefits it provides over the chosen base model.
1. Claims lacking proper justification In the abstract and introduction, the authors discuss a frequency shift phenomenon in time series data and claim that “Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in a frequency coverage imbalance issue.” However, the connection between this phenomenon and the limitations of MoE models is not clearly established. A more detailed explanation—ideally supported by theoretic
1. The paper introduces an intuitive adaptive frequency-domain MoE framework that dynamically selects experts based on spectral characteristics. 2. The use of learnable Gaussian filters provides smooth and flexible frequency decomposition beyond hard band partitioning. 3. The paper is clearly written with a coherent motivation and strong empirical results.
1. The motivation for operating in the frequency domain remains insufficiently justified. While the paper argues that dominant frequencies vary, it does not clearly explain why frequency-domain routing is fundamentally preferable to potential time-domain or variable-domain alternatives (e.g., learnable filters, SSMs, or channel-wise gating). 2. The experimental comparisons do not fully place the method within the broader literature on frequency-aware prediction. Other representative frequency-b
* The paper is well written and easy to follow * The proposed idea of an adaptive selection is novel and well suited for time series applications * It is beneficial that the approach can be used in combination with existing forecasting models * In the experiments many ablations experiments are performed to show the contributions of the individual components of the proposed approach and the sensitivity of the results to factors like numbers of experts or number of features.
* In the state of the are analysis the authors only focus on machine learning and here, on neural networks. Other, more classical, approaches are completely ignored. * Although being novel, the proposed contribution is rather incremental. The performance gains by the proposed MoE addon are given but rather limited. I considere it too little for a major ML conference like the ICLR. * There should be one paragraph in the paper that describes in more detail, how the proposed approach is combined
- The paper addresses a valid and practical problem in time series MoE models, which is the mismatch between a fixed number of experts and the shifting spectral distributions of real world data. - The core idea of adaptively selecting the number of experts ($K$) based on data specific spectral properties is novel and sensible. Using both frequency dominance ($\mu(f)$) and variable activity ($E(v)$) to inform this selection is a good design choice. - The model reports very strong performance on m
- The central mechanism for adaptive expert selection is not explained well. The paper says an MLP outputs the number $K$, which guides a Top K selection. It is completely unclear how this integer $K$ is trained in an end to end differentiable way. - The paper is very confusing about the model architecture. Table 1 shows Ada MoGE as a module, but Table 2 lists "Ada MoGE (Ours)" as a separate model. The architecture of this standalone model from Table 2 is not described, which makes the main resu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Stock Market Forecasting Methods · Traffic Prediction and Management Techniques
