NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
Dengdi Sun, Xiaoya Zhou, Xiao Wang, Hao Si, Wanli Lyu, Jin Tang, Bin Luo

TL;DR
This paper introduces NESTOR, a neural operator leveraging a nested Mixture-of-Experts framework to improve large-scale PDE pre-training by capturing both global and local dependencies, leading to better generalization.
Contribution
It proposes a novel nested MoE neural operator architecture that enhances PDE modeling capacity and transferability compared to traditional single-architecture neural operators.
Findings
Effective large-scale pre-training on twelve PDE datasets.
Improved transferability to downstream PDE tasks.
Demonstrated superior performance over existing methods.
Abstract
Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The idea is novel. It uses a two-level MoE for PDE operators, which is a novel architectural twist for scientific operators. Besides, using AFNO as a shared global expert plus FlashAttention inside an MoE operator is a thoughtful combination. 2. The paper is clear and well-organized, and the figures are quite good and easy to read. 3. The overall performance is positive to show the effectiveness of the method.
1. Computational cost not quantified. MoE + FlashAttention + AFNO is likely compute/memory heavy. The paper lacks FLOPs/throughput, GPU hours, tokens/updates, or cost-vs-quality curves, and no comparison to strong single-backbone operators at equal compute. 2. Most experiments appear on regular-grid, 2D fields with autoregressive next-frame targets. Little is shown for 3D, irregular meshes, varying boundary/geometry conditions.
- Introduction of an innovative Mixture of Experts architecture to deal with image level and token level PDE solution complexity - Use of a pre-training procedure on a vast array of PDEs and parameters, with fine-tuning on particular instances leading to increased performance. - State of the art results obtained on multiple PDE benchmarks studied - Ablation studies justify the need for the various components of the framework.
- The model is evaluated only on next-step prediction accuracy, without consideration for rollout stability over longer horizons. In contrast, DPOT incorporated noise during training to enhance robustness. Without similar analysis, it remains unclear whether this model maintains stability during multi-step predictions. - The model lacks interpretability regarding expert behavior. It is not demonstrated whether the individual experts specialize in different physical regimes or scales, leaving unc
- This work presents one of the first successfully trained Mixture-of-Experts (MoE) architectures in the context of foundation models for PDEs, marking a valuable step toward scalable, modular operator learning. - Despite the complexity of the nested MoE design (image-level and token-level routing), the authors demonstrate that the model can be trained stably and efficiently across a large and diverse collection of PDE datasets. - The paper shows attention to optimization stability, incorporat
- The paper does not reference or compare against current state-of-the-art foundation models for PDEs, such as Poseidon [1], which already explore large-scale pretraining and transfer to unseen downstream tasks. - In PDE settings, the concept of a temporal “frame” (Eq. 2) is ambiguous. A specific time step Δt must be fixed to define the “next frame,” which makes the model resolution-dependent. Once the temporal resolution is changed or sub-sampled, the model becomes inapplicable. Moreover, diff
- The paper is overall easy to follow. The idea is easy to understand. - The model architecture design is overall well motivated. The MoE design is a reasonable approach for large-scale heterogeneous PDE datasets pre-training. - The experimental results demonstrates the architecture's effectiveness over existing methods. The fine-tuning experiments are also interesting and show the model's transfer learning capability.
- It seems that this paper is finished in a hurry without careful proofreading. In Figure 1 "PDF's Diversity" "PDF's Complexity", which I think it should be "PDE". The citation formatting is also irregular throughout this paper. - Though designed carefully, the technical contributions regarding the model architecture and the task loss are somewhat limited. - It seems that the performance gain over the baseline is relatively modest; 6 of 14 metrics of your model does not exceed the baseline. As
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
