AM-PPO: (Advantage) Alpha-Modulation with Proximal Policy Optimization
Soham Sane

TL;DR
AM-PPO introduces an adaptive advantage modulation mechanism to improve the stability and performance of PPO in reinforcement learning by dynamically scaling advantage estimates based on their statistical properties.
Contribution
The paper proposes a novel advantage modulation technique with an alpha controller for PPO, enhancing stability and learning efficiency in reinforcement learning.
Findings
Achieves superior reward trajectories on continuous control benchmarks.
Reduces the need for clipping in adaptive optimizers.
Demonstrates improved stability and sustained learning progression.
Abstract
Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm that heavily relies on accurate advantage estimates for stable and efficient training. However, raw advantage signals can exhibit significant variance, noise, and scale-related issues, impeding optimal learning performance. To address this challenge, we introduce Advantage Modulation PPO (AM-PPO), a novel enhancement of PPO that adaptively modulates advantage estimates using a dynamic, non-linear scaling mechanism. This adaptive modulation employs an alpha controller that dynamically adjusts the scaling factor based on evolving statistical properties of the advantage signals, such as their norm, variance, and a predefined target saturation level. By incorporating a tanh-based gating function driven by these adaptively scaled advantages, AM-PPO reshapes the advantage signals to stabilize gradient updates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemiconductor materials and devices · Radiation Effects in Electronics
MethodsEntropy Regularization · Proximal Policy Optimization
