# Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy

**Authors:** Yunhao Tang, Mingzhang Yin, Mingyuan Zhou

arXiv: 1903.05284 · 2019-03-14

## TL;DR

This paper introduces the ARM policy gradient estimator, a low-variance, unbiased method for binary stochastic policies that improves stability and convergence speed in on-policy reinforcement learning.

## Contribution

It proposes the ARM policy gradient estimator for binary actions, providing theoretical variance reduction guarantees and enhanced training stability.

## Key findings

- Achieves variance reduction with theoretical guarantees.
- Leads to more stable policy training.
- Enables faster convergence of neural network policies.

## Abstract

Due to the high variance of policy gradients, on-policy optimization algorithms are plagued with low sample efficiency. In this work, we propose Augment-Reinforce-Merge (ARM) policy gradient estimator as an unbiased low-variance alternative to previous baseline estimators on tasks with binary action space, inspired by the recent ARM gradient estimator for discrete random variable models. We show that the ARM policy gradient estimator achieves variance reduction with theoretical guarantees, and leads to significantly more stable and faster convergence of policies parameterized by neural networks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.05284/full.md

## Figures

42 figures with captions in the complete paper: https://tomesphere.com/paper/1903.05284/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/1903.05284/full.md

---
Source: https://tomesphere.com/paper/1903.05284