Optimistic Policy Regularization
Mai Pham, Vikrant Vaze, Peter Chin

TL;DR
This paper introduces Optimistic Policy Regularization (OPR), a method that enhances deep reinforcement learning by maintaining successful trajectories, leading to improved sample efficiency and performance across Atari games and cyber-defense environments.
Contribution
The paper proposes OPR, a novel regularization technique that preserves successful behaviors during policy training, significantly boosting sample efficiency and performance in deep reinforcement learning.
Findings
OPR improves sample efficiency on Atari games.
OPR achieves higher scores in 22 out of 49 Atari environments.
OPR outperforms baseline methods in cyber-defense tasks.
Abstract
Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We introduce Optimistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization. OPR maintains a dynamic buffer of high-performing episodes and biases learning toward these behaviors through directional log-ratio reward shaping and an auxiliary behavioral cloning objective. When instantiated on Proximal Policy Optimization (PPO), OPR substantially improves sample efficiency on the Arcade Learning Environment. Across 49 Atari games evaluated at the 10-million step benchmark, OPR achieves the highest score in 22 environments despite baseline methods being reported at the standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning
