Policy Optimization for Continuous Reinforcement Learning
Hanyang Zhao, Wenpin Tang, David D. Yao

TL;DR
This paper develops a continuous-time reinforcement learning framework using occupation measures, deriving new performance formulas and extending popular policy optimization methods to continuous settings, with demonstrated numerical benefits.
Contribution
It introduces occupation time concepts for continuous RL, deriving performance formulas and adapting policy gradient and TRPO/PPO methods to continuous dynamics.
Findings
Effective continuous RL performance formulas derived
Extension of policy gradient and TRPO/PPO to continuous domain
Numerical experiments show improved performance and advantages
Abstract
We study reinforcement learning (RL) in the setting of continuous time and space, for an infinite horizon with a discounted objective and the underlying dynamics driven by a stochastic differential equation. Built upon recent advances in the continuous approach to RL, we develop a notion of occupation time (specifically for a discounted objective), and show how it can be effectively used to derive performance-difference and local-approximation formulas. We further extend these results to illustrate their applications in the PG (policy gradient) and TRPO/PPO (trust region policy optimization/ proximal policy optimization) methods, which have been familiar and powerful tools in the discrete RL setting but under-developed in continuous RL. Through numerical experiments, we demonstrate the effectiveness and advantages of our approach.
Peer Reviews
Decision·NeurIPS 2023 poster
The paper provides a thorough theoretical treatment of defining policy gradients for the continuous time RL setting. The paper clearly defines and adequately answers its stated objectives.
The biggest area for improvement in this paper is in the empirical results. While the main results of this paper are, theoretical new practical algorithms are presented. Thus, they deserve proper evaluation and experimentation to educate the reader on the challenges of using them. For example, there are no experiments illustrating that there were any special difficulties in applying these algorithms to the continuous setting. There should be experiments illustrating how the hyperparameters, pa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms
