Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach

Yeongjong Kim; Yeoneung Kim; Minseok Kim; Namkyeong Cho

arXiv:2508.01718·cs.LG·August 5, 2025

Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach

Yeongjong Kim, Yeoneung Kim, Minseok Kim, Namkyeong Cho

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a physics-informed neural network framework for solving stochastic optimal control problems, providing error control, interpretability, and convergence guarantees, demonstrated on various benchmark problems including high-dimensional LQR.

Contribution

It extends PINN-based methods to stochastic control, offering a systematic policy iteration approach with theoretical error bounds and convergence guarantees.

Findings

01

Effective on stochastic cartpole and pendulum problems

02

Achieves high accuracy in up to 10D LQR problems

03

Provides explicit bounds on policy evaluation errors

Abstract

We propose a physics-informed neural network policy iteration (PINN-PI) framework for solving stochastic optimal control problems governed by second-order Hamilton--Jacobi--Bellman (HJB) equations. At each iteration, a neural network is trained to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy. This linear structure enables systematic $L^{2}$ error control at each policy evaluation step, and allows us to derive explicit Lipschitz-type bounds that quantify how value gradient errors propagate to the policy updates. This interpretability provides a theoretical basis for evaluating policy quality during training. Our method extends recent deterministic PINN-based approaches to stochastic settings, inheriting the global exponential convergence guarantees of classical policy iteration under mild conditions. We demonstrate the effectiveness of…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- The neural network-based policy iteration is proposed to solve stochastic optimal control problems. This is a more practical solution since it does not need to solve a PDE explicitly as previous methods. - The authors characterize the function approximation error and the global convergence of policy iterates. This is a strong theoretical guarantee. - Experiments demonstrate the outstanding performance of the proposed method, compared to a standard SAC method.

Weaknesses

- Introducing network to policy iteration is not a new idea. It would be helpful if the authors could clarify the key challenges of applying it to stochastic optimal control problems. - The analysis of approximation error and global convergence is similar as the one in reinforcement learning. It is important to clarify new analysis challenges. - The neural network-based policy iteration requires accurate model information. - The provided experiments are limited to textbook examples.

Reviewer 02Rating 2Confidence 4

Strengths

The paper has a clear presentation where the readers can easily follow their motivations and they make the algorithm easy to understand. Also, the proposed method, at least to my knowledge, is original. However, I doubt the significance of the method, which I will detail in the Weakness section below.

Weaknesses

I believe the paper has a significant weakness in its empirical evaluations. First, the authors only perform experiments in at most 20 dimensions, which is not generally considered as high in the domain of using deep learning to solve stochastic optimal control (SOC) problems. The authors can refer to the following papers for more challenging experimental settings and baselines: [1] Hua, Mengjian, Mathieu Laurière, and Eric Vanden-Eijnden. "An Efficient On-Policy Deep Learning Framework for St

Reviewer 03Rating 6Confidence 3

Strengths

1) The paper is well-written and well-organized and it provides a clear and sound convergence analysis for policy iteration within a physics-informed learning framework. The $L^2$ stability results, Lipschitz continuity of the policy-improvement map, and exponential convergence proofs seems definitely nontrivial. 2) In contrast to most neural control methods that treat the PDE structure implicitly, this approach explicitly leverages elliptic PDE properties and energy estimates, yielding interp

Weaknesses

1) Although PINNs offer mesh-free flexibility, they are still computationally intensive, especially in high-dimensional control spaces. The paper would benefit from reporting computational cost, training time, and scalability comparisons against operator-learning or Galerkin-based solvers. 2) While the theory accounts for residual-based training error, the practical behavior of the neural approximator under finite sampling and stochastic training noise remains untested. 3) The approach assumes

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Adaptive Dynamic Programming Control · Reinforcement Learning in Robotics