Clipped-Objective Policy Gradients for Pessimistic Policy Optimization
Jared Markowitz, Edward W. Staley

TL;DR
This paper introduces a simple clipped-objective policy gradient (COPG) method that enhances exploration and improves learning performance in deep reinforcement learning, especially in continuous action spaces, by promoting a pessimistic policy update.
Contribution
The paper proposes a novel clipped-objective policy gradient that is more pessimistic and promotes exploration, leading to better performance than PPO and comparable or better results than TRPO.
Findings
COPG improves learning performance over PPO in various settings.
Pessimistic objective promotes enhanced exploration.
COPG achieves comparable or superior results to TRPO.
Abstract
To facilitate efficient learning, policy gradient approaches to deep reinforcement learning (RL) are typically paired with variance reduction measures and strategies for making large but safe policy changes based on a batch of experiences. Natural policy gradient methods, including Trust Region Policy Optimization (TRPO), seek to produce monotonic improvement through bounded changes in policy outputs. Proximal Policy Optimization (PPO) is a commonly used, first-order algorithm that instead uses loss clipping to take multiple safe optimization steps per batch of data, replacing the bound on the single step of TRPO with regularization on multiple steps. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. Instead of the importance sampling objective of PPO, we instead recommend a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Advanced Memory and Neural Computing · Fuel Cells and Related Materials
MethodsEntropy Regularization · Proximal Policy Optimization · Trust Region Policy Optimization
