Low-Variance Policy Gradient Estimation with World Models

Michal Nauman; Floris Den Hengst

arXiv:2010.15622·stat.ML·October 30, 2020

Low-Variance Policy Gradient Estimation with World Models

Michal Nauman, Floris Den Hengst

PDF

Open Access

TL;DR

This paper introduces WMPG, a novel policy gradient method that leverages learned world models to generate imagined trajectories, reducing variance and improving sample efficiency in reinforcement learning tasks.

Contribution

The paper presents WMPG, a new approach that uses learned world models to estimate policy gradients with lower variance and higher sample efficiency.

Findings

01

WMPG outperforms AC and MAC in sample efficiency across tested environments.

02

Imagined trajectories serve as effective baselines and estimators.

03

WMPG benefits from robust latent environment representations.

Abstract

In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Machine Learning and Algorithms