Match or Replay: Self Imitating Proximal Policy Optimization
Gaurav Chaudhary, Laxmidhar Behera, Washim Uddin Mondal

TL;DR
This paper introduces a self-imitating on-policy RL algorithm that improves exploration and sample efficiency by leveraging past successful experiences, demonstrating faster learning and higher success rates in various environments.
Contribution
The paper proposes a novel self-imitating RL method using optimal transport and trajectory replay to enhance exploration and efficiency in both dense and sparse reward settings.
Findings
Significant improvements in learning speed across tested environments.
Higher success rates compared to existing self-imitating RL methods.
Effective in both dense and sparse reward scenarios.
Abstract
Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
