Convergence Guarantees for Gradient-Based Training of Neural PDE Solvers: From Linear to Nonlinear PDEs
Wei Zhao, Tao Luo

TL;DR
This paper develops a comprehensive convergence theory for gradient-based neural PDE solvers, covering linear and nonlinear cases, and reveals implicit regularization effects through theoretical analysis and numerical experiments.
Contribution
It extends the neural tangent kernel framework to linear PDEs and proves convergence for nonlinear PDEs without strong over-parameterization, unifying theory for neural PDE training.
Findings
Global convergence guarantees for linear PDEs via NTK extension.
Convergence to critical points for nonlinear PDEs under the ojasiewicz inequality.
Implicit regularization prevents parameter divergence in neural PDE training.
Abstract
We present a unified convergence theory for gradient-based training of neural network methods for partial differential equations (PDEs), covering both physics-informed neural networks (PINNs) and the Deep Ritz method. For linear PDEs, we extend the neural tangent kernel (NTK) framework for PINNs to establish global convergence guarantees for a broad class of linear operators. For nonlinear PDEs, we prove convergence to critical points via the \L{}ojasiewicz inequality under the random feature model, eliminating the need for strong over-parameterization and encompassing both gradient flow and implicit gradient descent dynamics. Our results further reveal that the random feature model exhibits an implicit regularization effect, preventing parameter divergence to infinity. Theoretical findings are corroborated by numerical experiments, providing new insights into the training dynamics and…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is overall well written and easy to follow, though a few key notations could be introduced earlier for clarity. - The work approaches the study of training dynamics and convergence from a rigorous theoretical perspective.
- Convergence of two-layer MLPs under the NTK regime is already well established. Theorem 1 merely applies this known result under an additional regularization on the differential operator. Moreover, the theorem could fail if the smallest eigenvalues are zero, and the authors does not discuss this possibility and even not officially define the Gram matrices before or within the theorem statement until next subsection. - Sections 4.1–4.3 train only the second layer while freezing the first-layer
1. The novel application of the Łojasiewicz inequality to analyze the convergence of the non-convex loss functions encountered when solving nonlinear PDEs with neural networks. 2. The article provides a technically solid extension of the NTK theory for linear PDEs. 3. This theoretical article presents mathematically rigorous convergence results that span a wide class of PDEs.
1. The core weakness is proving convergence for nonlinear problems only under the RFM, where only the last layer weights are trained. 2. The empirical validation of the derived theory is very limited, focusing only on a single 1-D nonlinear PDE and lacking higher-dimensional or real-world benchmarks. 3. For nonlinear PDEs, the proven convergence is only to a critical point of the loss function. A critical point is merely a state where the gradient is zero, which could be a local minimum. 4. The
(1) The manuscript is well written with a good motivation on the missing convergence properties for non-linear PDEs for PINNs. Especially the PDE setting considered in the paper is concise and on point. (2) The manuscript touches on the concrete missing points in the convergence theory and error bounds for PINNs / deep Ritz.
(1) The paper heavily relies on the random-feature weight initialization, but does not provide a through literature review on the random features one can use. - Fourier random feature approaches should be cited for data-agnostic cases, e.g. (1.1) Li, Zhu, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. 2021. “Towards a Unified Analysis of Random Fourier Features.” Journal of Machine Learning Research 22 (108): 1–51. - Data-driven approaches for random features are also missing; e.g. (1.2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Machine Learning in Materials Science · Quantum many-body systems
