Proper Value Equivalence
Christopher Grimm, Andr\'e Barreto, Gregory Farquhar, David Silver,, Satinder Singh

TL;DR
This paper introduces the concept of proper value equivalence (PVE) in model-based RL, generalizing VE to order-$k$ and proposing a loss function for learning models that are sufficient for optimal planning, with practical improvements for MuZero.
Contribution
It generalizes the VE principle to order-$k$, defines PVE, and connects it to existing algorithms like MuZero, proposing modifications for better performance.
Findings
PVE models are sufficient for optimal planning despite ignoring many environment aspects.
A new loss function for learning PVE models is constructed.
Modified MuZero with PVE principles shows improved practical performance.
Abstract
One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Formal Methods in Verification · Machine Learning and Algorithms
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Residual Connection · Prioritized Experience Replay · Residual Block · Convolution · Average Pooling · Monte-Carlo Tree Search · MuZero
