Relative Entropy Pathwise Policy Optimization

Claas Voelcker; Axel Brunnbauer; Marcel Hussing; Michal Nauman; Pieter Abbeel; Eric Eaton; Radu Grosu; Amir-massoud Farahmand; Igor Gilitschenski

arXiv:2507.11019·cs.LG·April 14, 2026

Relative Entropy Pathwise Policy Optimization

Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski

PDF

1 Video

TL;DR

The paper introduces REPPO, an on-policy policy optimization algorithm that combines pathwise gradients with stable, efficient training, achieving superior performance and robustness over existing methods.

Contribution

It presents a novel on-policy algorithm that uses pathwise policy gradients with on-policy Q-value training, enhancing stability and efficiency.

Findings

01

REPPO outperforms state-of-the-art methods on benchmark tasks.

02

It demonstrates superior sample efficiency and reduced memory footprint.

03

The algorithm shows robustness to hyperparameter variations.

Abstract

Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Relative Entropy Pathwise Policy Optimization· slideslive