Generalised Policy Improvement with Geometric Policy Composition
Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, R\'emi, Munos, Andr\'e Barreto

TL;DR
This paper introduces a novel policy improvement method using geometric horizon models to interpolate between value-based and model-based RL, enabling effective policy composition and transfer.
Contribution
It presents a new approach for evaluating and composing non-Markov policies using GHMs, with theoretical analysis and empirical validation in deep RL tasks.
Findings
The method outperforms standard GPI in continuous control tasks.
Theoretical convergence guarantees for GHM training methods.
Stable training procedures for deep RL applications.
Abstract
We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsBalanced Selection
