Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime
Andrea Agazzi, Jianfeng Lu

TL;DR
This paper analyzes the training dynamics of wide single hidden layer neural networks with softmax policy gradient methods in reinforcement learning, proving global optimality of fixed points in the mean-field regime.
Contribution
It establishes the Wasserstein gradient flow framework for policy optimization and proves global optimality of fixed points under mild initialization conditions.
Findings
Wasserstein gradient flow describes policy training dynamics in the mean-field regime.
Global optimality of fixed points is proven under mild conditions.
Analysis applies to wide single hidden layer neural networks with entropy regularization.
Abstract
We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Neural Networks and Applications
MethodsSoftmax
