Reflective Policy Optimization

Yaozhong Gan; Renye Yan; Zhe Wu; Junliang Xing

arXiv:2406.03678·cs.LG·June 7, 2024

Reflective Policy Optimization

Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Reflective Policy Optimization (RPO) enhances on-policy reinforcement learning by integrating past and future state-action data, improving sample efficiency and convergence speed through theoretical guarantees and empirical validation.

Contribution

Introduces RPO, a novel on-policy method that combines past and future information for improved policy optimization and sample efficiency.

Findings

01

RPO guarantees monotonic policy improvement.

02

RPO accelerates convergence in benchmark tasks.

03

RPO achieves superior sample efficiency.

Abstract

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 3· reject, not good enoughConfidence 2

Strengths

I believe the authors want to say that their method performs a kind of hindsight credit assignment \citep{harutyunyan2019}, but this is just a hunch. I didn't really understand from the text. The paper appears to be backed by theory and support their superior performance claims with empirical results. Unfortunately, I did not really understand the motivation and method to assess these properly. Please see below my points of confusion. I am happy to revise my score if the authors the motivation

Weaknesses

First, I do not understand why the authors believe proximal methods do not already account for the true gradient of the policy which also considers the contribution through the stationary distribution. Indeed, in the original CPI paper \citep{kakade}, the authors first derive the form of the bound that contains the stationary distribution with the current policy, and not a prior policy, then use a mixture policy to ensure that the new policy is “close” to the prior one and justify replacing the

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

### Originality The idea of consider the subsequent state-action pairs in the policy optimization sounds interesting and novel. The theory (if correct) can add new insights into the policy gradient methods when considering unrolling it for a long horizon.

Weaknesses

### Quality & significance **There can be some technical misstatement & errors in the main contributions**. First, the following statement can be erroneous: “If you recombine the above equation $(r_0-1)r_1 A^{\hat{\pi}}(s_1, a_1)]\cdot r_1 >0$, optimizing it will be found to increase the probability of $a_1$ . However, $A_{\hat{\pi}}(s_1 , a_1) < 0$, we should decrease the probability of $a_1$ . This would present a contradiction.” **Optimizing this objective can result in a bit more complicate

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1) For two policies, authors propose to maximize a generalized lower bound that directly takes into account that policy performance is related to the next state-action pair. In particular, they interestingly prove that the optimized policy is reflective through some theory and show that TRPO is a special case of this method, which allows the policy to be monotonically improved. 2) In this paper, authors consider multi-step RL directly from the policy optimization perspective, which is different

Weaknesses

1) The theorems and proofs proposed in the paper are interesting, but the performance shown in the experiments does not seem to be much different from the performance of the baselines. In particular, there are many mentions of convergence speed, but the results shown do not show a significant improvement over existing methods. It is necessary to show the performance improvement in an environment where the subsequent state-action can be better considered. 2) The clipping working environment that

Code & Models

Repositories

edgargan/rpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems