Optimistic Proximal Policy Optimization
Takahisa Imagawa, Takuya Hiraoka, Yoshimasa Tsuruoka

TL;DR
This paper introduces OPPO, an enhancement to proximal policy optimization that uses optimism under uncertainty to improve reinforcement learning in environments with sparse rewards, demonstrating superior performance in tabular tasks.
Contribution
The paper proposes OPPO, a novel reinforcement learning algorithm that incorporates optimism in the face of uncertainty to better handle sparse reward scenarios.
Findings
OPPO outperforms existing methods in tabular tasks.
Using optimism improves policy evaluation under reward uncertainty.
OPPO effectively addresses the challenge of sparse rewards.
Abstract
Reinforcement Learning, a machine learning framework for training an autonomous agent based on rewards, has shown outstanding results in various domains. However, it is known that learning a good policy is difficult in a domain where rewards are rare. We propose a method, optimistic proximal policy optimization (OPPO) to alleviate this difficulty. OPPO considers the uncertainty of the estimated total return and optimistically evaluates the policy based on that amount. We show that OPPO outperforms the existing methods in a tabular task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
