Joint action loss for proximal policy optimization
Xiulei Song, Yizhao Jin, Greg Slabaugh, Simon Lucas

TL;DR
This paper introduces a novel joint action loss for PPO that improves sample efficiency and performance in complex environments by separately considering sub-actions and combining joint and separate probabilities.
Contribution
It proposes a multi-action mixed loss that enhances PPO by better handling compound actions and reducing clipping issues, leading to significant performance gains.
Findings
Over 50% performance improvement in MuJoCo environments.
Sub-action loss outperforms standard PPO in Gym-μRTS.
Better balance of sample efficiency and action quality.
Abstract
PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Topic Modeling
MethodsEntropy Regularization · Proximal Policy Optimization · Contrastive Language-Image Pre-training
