Low-Variance Policy Gradient Estimation with World Models
Michal Nauman, Floris Den Hengst

TL;DR
This paper introduces WMPG, a novel policy gradient method that leverages learned world models to generate imagined trajectories, reducing variance and improving sample efficiency in reinforcement learning tasks.
Contribution
The paper presents WMPG, a new approach that uses learned world models to estimate policy gradients with lower variance and higher sample efficiency.
Findings
WMPG outperforms AC and MAC in sample efficiency across tested environments.
Imagined trajectories serve as effective baselines and estimators.
WMPG benefits from robust latent environment representations.
Abstract
In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Machine Learning and Algorithms
