TL;DR
The paper introduces REPPO, an on-policy policy optimization algorithm that combines pathwise gradients with stable, efficient training, achieving superior performance and robustness over existing methods.
Contribution
It presents a novel on-policy algorithm that uses pathwise policy gradients with on-policy Q-value training, enhancing stability and efficiency.
Findings
REPPO outperforms state-of-the-art methods on benchmark tasks.
It demonstrates superior sample efficiency and reduced memory footprint.
The algorithm shows robustness to hyperparameter variations.
Abstract
Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
