Policy Optimization with Second-Order Advantage Information
Jiajin Li, Baoxiang Wang

TL;DR
This paper introduces POSA, a variance-reducing policy gradient estimator that leverages second-order advantage information and action space factorization, improving performance in high-dimensional control tasks.
Contribution
The paper proposes POSA, a novel policy optimization method that incorporates second-order advantage information and action subspace factorization to reduce gradient variance.
Findings
Demonstrates improved performance on high-dimensional synthetic tasks.
Achieves better results on MuJoCo continuous control benchmarks.
Effectively captures quadratic information with a wide & deep architecture.
Abstract
Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide & deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
