Generalised Policy Improvement with Geometric Policy Composition

Shantanu Thakoor; Mark Rowland; Diana Borsa; Will Dabney; R\'emi; Munos; Andr\'e Barreto

arXiv:2206.08736·stat.ML·June 20, 2022·1 cites

Generalised Policy Improvement with Geometric Policy Composition

Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, R\'emi, Munos, Andr\'e Barreto

PDF

Open Access

TL;DR

This paper introduces a novel policy improvement method using geometric horizon models to interpolate between value-based and model-based RL, enabling effective policy composition and transfer.

Contribution

It presents a new approach for evaluating and composing non-Markov policies using GHMs, with theoretical analysis and empirical validation in deep RL tasks.

Findings

01

The method outperforms standard GPI in continuous control tasks.

02

Theoretical convergence guarantees for GHM training methods.

03

Stable training procedures for deep RL applications.

Abstract

We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsBalanced Selection