A Stochastic Trust-Region Framework for Policy Optimization
Mingming Zhao, Yongfeng Li, Zaiwen Wen

TL;DR
This paper introduces a stochastic trust-region framework for policy optimization in deep reinforcement learning, addressing theoretical and numerical challenges to improve policy performance and robustness.
Contribution
It proposes a novel stochastic trust-region method with a line search and bias correction, ensuring monotonic reward improvement and convergence in policy optimization.
Findings
Effective in robotic control tasks
Robust performance in game simulations
Guarantees monotonic reward increase
Abstract
In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Age of Information Optimization
