On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling
Nicholas E. Corrado, Josiah P. Hanna

TL;DR
This paper introduces PROPS, an off-policy sampling method that reduces sampling error in on-policy policy gradient RL, leading to improved data efficiency and more reliable policy updates.
Contribution
The paper proposes PROPS, an adaptive off-policy sampling technique that reduces sampling error in on-policy RL, enhancing data efficiency and policy learning stability.
Findings
PROPS decreases sampling error throughout training.
PROPS increases data efficiency of policy gradient algorithms.
Empirical results on MuJoCo and discrete tasks show improved performance.
Abstract
On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling…
Peer Reviews
Decision·Submitted to ICLR 2025
Strengths: 1. Estimating the on-policy gradient requires on-policy samples. Because we do not have access to the true gradient, using the target policy to draw on-policy samples to estimate the true gradient has an estimation error. This error is caused by under-sampled or over-sampled data. Encouraging under-sampled on-policy samples reduces the sampling error and improves gradient estimation. This paper is the first to apply this idea to the PPO algorithm. 2. This paper proposes an algorithm t
Weaknesses: 1. This paper adjusts the batch size of PPO to "1024,2048,4096,8192" which is much larger than the original batch size (64 samples) of PPO. This paper makes this change because they need to learn a behavior policy to encourage under-sampled data at each update step. In other words, they run a mini-PPO to encourage under-sampled data by finding a behavior policy. This drastic increase in batch size is not well-justified. In fact, it is confusing that their PPO algorithm still achieves
1. The sampling errors problem the authors tackle is interesting, there has not been sufficient discussion on this topic in the RL community. 2. The presentation and writing is generally clear. 3. The sampling errors problem can arise in many real-world setups. 4. The proposes solution is simple in a good way.
1. The experiments and motivation the authors provided are not accommodating for the problem. When is this really a problem in the real-world? In most standard examples there is ample time to correct the sampling errors with more data and simulation (this happens with a not a lot of samples in practice). Perhaps in cases of non-stationarity of the system itself this "quicker" adaptation become crucial? or maybe for very large state\action-spaces this is more of a problem. I propose the authors
Originality ------------- The idea proposed is novel. Although the method proposed until page 5 is from Zong's paper, and the application to policy gradient estimation seems somewhat trivial, I think it is still valuable and necessary. Quality ---------- The quality of the paper is really good. The authors explain the problem and the main idea very well. The method is sounds and directly addresses the problem presented. The experiments are well-designed, explained, and commented on. The resul
Contribution ----------------- The paper's contribution starts on page 5: the author focuses on presenting Zhong et al.'s paper there. The core contribution is the application of Zhon's idea for PG estimation and the use of PPO clipping and KL divergence to prevent too aggressive updates. While I think it is necessary that the authors devote that space to expose Zhong et al.'s idea, I think that it might not be clear to the reader that **that is not** the main core idea of the paper. Clarit
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
