Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing; Theophane Weber; Yori Zwols; Sebastien Racaniere; Arthur; Guez; Jean-Baptiste Lespiau; Nicolas Heess

arXiv:1811.06272·cs.LG·November 16, 2018·41 cites

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur, Guez, Jean-Baptiste Lespiau, Nicolas Heess

PDF

Open Access

TL;DR

This paper introduces CF-GPS, a counterfactually-guided policy search method that uses structural causal models to improve policy learning from logged data in POMDPs, addressing biases in model-based RL.

Contribution

The paper presents CF-GPS, a novel algorithm that leverages counterfactual reasoning with structural causal models to enhance policy search from off-policy data.

Findings

01

CF-GPS improves policy evaluation and search in complex environments.

02

It outperforms vanilla model-based RL algorithms on a grid-world task.

03

CF-GPS generalizes Guided Policy Search and relates to reparameterization methods.

Abstract

Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms