Towards Combining On-Off-Policy Methods for Real-World Applications

Kai-Chun Hu; Chen-Huan Pi; Ting Han Wei; I-Chen Wu; Stone Cheng,; Yi-Wei Dai; Wei-Yuan Ye

arXiv:1904.10642·cs.LG·April 25, 2019·1 cites

Towards Combining On-Off-Policy Methods for Real-World Applications

Kai-Chun Hu, Chen-Huan Pi, Ting Han Wei, I-Chen Wu, Stone Cheng,, Yi-Wei Dai, Wei-Yuan Ye

PDF

Open Access

TL;DR

This paper introduces a unified formulation for on-policy and off-policy reinforcement learning, enabling more flexible policy updates and demonstrating its effectiveness on simulated and real-world control tasks.

Contribution

It reformulates the policy gradient objective into a perceptron-like loss, unifying on-policy and off-policy methods and enabling off-policy training in reinforcement learning.

Findings

01

The new formulation matches the PPO clipped surrogate objective.

02

The combined method performs well on simulated and real quadrotor tasks.

03

Policies trained with this approach are efficient and suitable for real-time control.

Abstract

In this paper, we point out a fundamental property of the objective in reinforcement learning, with which we can reformulate the policy gradient objective into a perceptron-like loss function, removing the need to distinguish between on and off policy training. Namely, we posit that it is sufficient to only update a policy $π$ for cases that satisfy the condition $A (\frac{π}{μ} - 1) \leq 0$ , where $A$ is the advantage, and $μ$ is another policy. Furthermore, we show via theoretic derivation that a perceptron-like loss function matches the clipped surrogate objective for PPO. With our new formulation, the policies $π$ and $μ$ can be arbitrarily apart in theory, effectively enabling off-policy training. To examine our derivations, we can combine the on-policy PPO clipped surrogate (which we show to be equivalent with one instance of the new reformation) with the off-policy IMPALA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReal-Time Systems Scheduling · Reinforcement Learning in Robotics · Parallel Computing and Optimization Techniques

MethodsSigmoid Activation · Tanh Activation · V-trace · Experience Replay · Entropy Regularization · Residual Connection · Gradient Clipping · RMSProp · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling