A Stochastic Trust-Region Framework for Policy Optimization

Mingming Zhao; Yongfeng Li; Zaiwen Wen

arXiv:1911.11640·math.OC·November 27, 2019·1 cites

A Stochastic Trust-Region Framework for Policy Optimization

Mingming Zhao, Yongfeng Li, Zaiwen Wen

PDF

Open Access

TL;DR

This paper introduces a stochastic trust-region framework for policy optimization in deep reinforcement learning, addressing theoretical and numerical challenges to improve policy performance and robustness.

Contribution

It proposes a novel stochastic trust-region method with a line search and bias correction, ensuring monotonic reward improvement and convergence in policy optimization.

Findings

01

Effective in robotic control tasks

02

Robust performance in game simulations

03

Guarantees monotonic reward increase

Abstract

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Age of Information Optimization