Learning Successor States and Goal-Dependent Values: A Mathematical   Viewpoint

L\'eonard Blier; Corentin Tallec; Yann Ollivier

arXiv:2101.07123·cs.LG·January 19, 2021

Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint

L\'eonard Blier, Corentin Tallec, Yann Ollivier

PDF

Open Access

TL;DR

This paper provides a mathematical framework for learning successor states and goal-dependent values in reinforcement learning, introducing new operators and estimators that improve convergence and robustness, especially in sparse reward settings.

Contribution

It derives novel TD algorithms for successor states and goal-dependent values, introduces the Bellman-Newton operator, and proposes a forward-backward parameterization for better variance reduction.

Findings

01

Finite-variance estimators for continuous environments.

02

Bellman-Newton operator improves convergence over TD.

03

Forward-backward parameterization reduces variance and models value functions.

Abstract

In reinforcement learning, temporal difference-based algorithms can be sample-inefficient: for instance, with sparse rewards, no learning occurs until a reward is observed. This can be remedied by learning richer objects, such as a model of the environment, or successor states. Successor states model the expected future state occupancy from any given state for a given policy and are related to goal-dependent value functions, which learn how to reach arbitrary states. We formally derive the temporal difference algorithm for successor state and goal-dependent value function learning, either for discrete or for continuous environments with function approximation. Especially, we provide finite-variance estimators even in continuous environments, where the reward for exactly reaching a goal state becomes infinitely sparse. Successor states satisfy more than just the Bellman equation: a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research