Reparameterization Proximal Policy Optimization

Hai Zhong; Xun Wang; Zhuoran Li; Longbo Huang

arXiv:2508.06214·cs.LG·February 9, 2026

Reparameterization Proximal Policy Optimization

Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Reparameterization Proximal Policy Optimization (RPO), a novel method that enhances sample efficiency and stability in reinforcement learning by combining differentiable dynamics with PPO-style updates.

Contribution

RPO unifies RPG and PPO frameworks, incorporating sample reuse, a clipped policy gradient, and KL regularization to improve stability and efficiency.

Findings

01

RPO achieves superior sample efficiency across tasks.

02

RPO outperforms existing methods in diverse environments.

03

The framework maintains stability through KL regularization.

Abstract

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. replacing the policy gradient-based objective of RPG with an PPO objective is interesting in terms of sample reuse. 2. using the old backpropagation through time (BPTT) trick to train the RPG model has not been done before.

Weaknesses

1.The primary contribution of this work lies in the reuse of trajectory samples. Given this, a comparison with off-policy evaluation and multi-step Q-learning methods is warranted, as they similarly grapple with the risk of high variance or inaccurate estimation caused by products of importance ratios. Although the authors employ a clipping trick to mitigate numerical instability, this approach comes at the cost of sample effectiveness, potentially preventing a significant portion of trajectorie

Reviewer 02Rating 4Confidence 5

Strengths

Generally, I like the idea of the paper. Using the PPO's idea of surrogate function to update the computation graph and reuse the previous samples is smart. In addition, the paper explores an important problem — stabilizing reparameterization-based RL in differentiable simulators — and provides an algorithm that is simple, general, and compatible with existing frameworks. The empirical results on DFlex and Rewarped demonstrate that the method can improve stability and efficiency across multiple

Weaknesses

Though the idea of the paper looks good, its writing is not very good to me. ### The clearness of the presentation: I found the paper hard to follow and understand. The derivation of the formulas is not very clear and solid, which need largely polish. 1. Eq. (8) and the following text are confusing to me. They use $A(s,a)$ in Eq. (8), and then in line 283 they use $\nabla R(\tau)$. It would greatly increase readability if Eq. (8) could be written in the form of $R(\tau)$. 2. In Eq. (3), the exp

Reviewer 03Rating 4Confidence 3

Strengths

1) The paper is well-written and the ideas are presented clearly, making it accessible and easy to follow. 2) The experimental part provides sufficient verification on multiple tasks, and well-chosen baseline methods, clearly demonstrating the advantages of RPO in terms of the sample efficiency and training stability. 3) Ablation experiment is well-organized, which separately describes the contributions of three algorithm improvements: KL regularization, sample reuse and clipping mechanism.

Weaknesses

1) Introducing the clip and KL divergence penalty operations from PPO into model-based methods is a relatively intuitive idea, and it is not novel in terms of the algorithm improvement. If a theoretical analysis can be provided to prove that the improved training algorithm can lead to a better final performance and more stable training, the contributions of this paper will be enriched. 2) During the design of the PPO algorithm, it was observed that there is a certain degree of redundancy betwee

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques