Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games
Rong-Jun Qin, Fan-Ming Luo, Hong Qian, Yang Yu

TL;DR
This paper introduces PORL, a no-regret reinforcement learning algorithm for continuous actions in non-stationary environments, with proven convergence and superior performance in dynamic and adversarial settings.
Contribution
It proposes a novel PORL algorithm based on FTRL and MD, with last-iterate convergence guarantees for non-stationary continuous-action tasks.
Findings
PORL matches or exceeds SAC in stationary environments.
PORL outperforms SAC in non-stationary and adversarial environments.
PORL demonstrates stable training and better final policies.
Abstract
This paper addresses policy learning in non-stationary environments and games with continuous actions. Rather than the classical reward maximization mechanism, inspired by the ideas of follow-the-regularized-leader (FTRL) and mirror descent (MD) update, we propose a no-regret style reinforcement learning algorithm PORL for continuous action tasks. We prove that PORL has a last-iterate convergence guarantee, which is important for adversarial and cooperative games. Empirical studies show that, in stationary environments such as MuJoCo locomotion controlling tasks, PORL performs equally well as, if not better than, the soft actor-critic (SAC) algorithm; in non-stationary environments including dynamical environments, adversarial training, and competitive games, PORL is superior to SAC in both a better final policy performance and a more stable training process.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsGlobal Average Pooling · Dilated Convolution · Convolution · 1x1 Convolution · Average Pooling · Switchable Atrous Convolution
