Towards Combining On-Off-Policy Methods for Real-World Applications
Kai-Chun Hu, Chen-Huan Pi, Ting Han Wei, I-Chen Wu, Stone Cheng,, Yi-Wei Dai, Wei-Yuan Ye

TL;DR
This paper introduces a unified formulation for on-policy and off-policy reinforcement learning, enabling more flexible policy updates and demonstrating its effectiveness on simulated and real-world control tasks.
Contribution
It reformulates the policy gradient objective into a perceptron-like loss, unifying on-policy and off-policy methods and enabling off-policy training in reinforcement learning.
Findings
The new formulation matches the PPO clipped surrogate objective.
The combined method performs well on simulated and real quadrotor tasks.
Policies trained with this approach are efficient and suitable for real-time control.
Abstract
In this paper, we point out a fundamental property of the objective in reinforcement learning, with which we can reformulate the policy gradient objective into a perceptron-like loss function, removing the need to distinguish between on and off policy training. Namely, we posit that it is sufficient to only update a policy for cases that satisfy the condition , where is the advantage, and is another policy. Furthermore, we show via theoretic derivation that a perceptron-like loss function matches the clipped surrogate objective for PPO. With our new formulation, the policies and can be arbitrarily apart in theory, effectively enabling off-policy training. To examine our derivations, we can combine the on-policy PPO clipped surrogate (which we show to be equivalent with one instance of the new reformation) with the off-policy IMPALA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReal-Time Systems Scheduling · Reinforcement Learning in Robotics · Parallel Computing and Optimization Techniques
MethodsSigmoid Activation · Tanh Activation · V-trace · Experience Replay · Entropy Regularization · Residual Connection · Gradient Clipping · RMSProp · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling
