Policy Optimization with Second-Order Advantage Information

Jiajin Li; Baoxiang Wang

arXiv:1805.03586·cs.LG·May 30, 2019·1 cites

Policy Optimization with Second-Order Advantage Information

Jiajin Li, Baoxiang Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces POSA, a variance-reducing policy gradient estimator that leverages second-order advantage information and action space factorization, improving performance in high-dimensional control tasks.

Contribution

The paper proposes POSA, a novel policy optimization method that incorporates second-order advantage information and action subspace factorization to reduce gradient variance.

Findings

01

Demonstrates improved performance on high-dimensional synthetic tasks.

02

Achieves better results on MuJoCo continuous control benchmarks.

03

Effectively captures quadratic information with a wide & deep architecture.

Abstract

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide & deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangbx66/Action-Subspace-Dependent
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques