ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

Hanyong Wang; Menglong Yang

arXiv:2602.09726·cs.LG·February 11, 2026

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

Hanyong Wang, Menglong Yang

PDF

Open Access

TL;DR

ExO-PPO is a novel reinforcement learning algorithm that combines the stability of on-policy methods with the sample efficiency of off-policy techniques, leading to improved performance across various tasks.

Contribution

This paper introduces ExO-PPO, a new PPO variant that leverages off-policy data using a segmented exponential clipping mechanism and a replay buffer for enhanced efficiency and stability.

Findings

01

ExO-PPO outperforms PPO and other variants in empirical tests.

02

ExO-PPO achieves better sample efficiency and stability.

03

The method demonstrates versatility across different tasks.

Abstract

Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control