Policy Optimization for Continuous Reinforcement Learning

Hanyang Zhao; Wenpin Tang; David D. Yao

arXiv:2305.18901·cs.LG·October 19, 2023·6 cites

Policy Optimization for Continuous Reinforcement Learning

Hanyang Zhao, Wenpin Tang, David D. Yao

PDF

Open Access 1 Reviews

TL;DR

This paper develops a continuous-time reinforcement learning framework using occupation measures, deriving new performance formulas and extending popular policy optimization methods to continuous settings, with demonstrated numerical benefits.

Contribution

It introduces occupation time concepts for continuous RL, deriving performance formulas and adapting policy gradient and TRPO/PPO methods to continuous dynamics.

Findings

01

Effective continuous RL performance formulas derived

02

Extension of policy gradient and TRPO/PPO to continuous domain

03

Numerical experiments show improved performance and advantages

Abstract

We study reinforcement learning (RL) in the setting of continuous time and space, for an infinite horizon with a discounted objective and the underlying dynamics driven by a stochastic differential equation. Built upon recent advances in the continuous approach to RL, we develop a notion of occupation time (specifically for a discounted objective), and show how it can be effectively used to derive performance-difference and local-approximation formulas. We further extend these results to illustrate their applications in the PG (policy gradient) and TRPO/PPO (trust region policy optimization/ proximal policy optimization) methods, which have been familiar and powerful tools in the discrete RL setting but under-developed in continuous RL. Through numerical experiments, we demonstrate the effectiveness and advantages of our approach.

Peer Reviews

Decision·NeurIPS 2023 poster

Reviewer 01Rating 7· Accept: Technically solid paper, with high impact on at least one sub-area, or moderate-to-high impact on more than one areas, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations.Confidence 3

Strengths

The paper provides a thorough theoretical treatment of defining policy gradients for the continuous time RL setting. The paper clearly defines and adequately answers its stated objectives.

Weaknesses

The biggest area for improvement in this paper is in the empirical results. While the main results of this paper are, theoretical new practical algorithms are presented. Thus, they deserve proper evaluation and experimentation to educate the reader on the challenges of using them. For example, there are no experiments illustrating that there were any special difficulties in applying these algorithms to the continuous setting. There should be experiments illustrating how the hyperparameters, pa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Multi-Objective Optimization Algorithms