Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint
L\'eonard Blier, Corentin Tallec, Yann Ollivier

TL;DR
This paper provides a mathematical framework for learning successor states and goal-dependent values in reinforcement learning, introducing new operators and estimators that improve convergence and robustness, especially in sparse reward settings.
Contribution
It derives novel TD algorithms for successor states and goal-dependent values, introduces the Bellman-Newton operator, and proposes a forward-backward parameterization for better variance reduction.
Findings
Finite-variance estimators for continuous environments.
Bellman-Newton operator improves convergence over TD.
Forward-backward parameterization reduces variance and models value functions.
Abstract
In reinforcement learning, temporal difference-based algorithms can be sample-inefficient: for instance, with sparse rewards, no learning occurs until a reward is observed. This can be remedied by learning richer objects, such as a model of the environment, or successor states. Successor states model the expected future state occupancy from any given state for a given policy and are related to goal-dependent value functions, which learn how to reach arbitrary states. We formally derive the temporal difference algorithm for successor state and goal-dependent value function learning, either for discrete or for continuous environments with function approximation. Especially, we provide finite-variance estimators even in continuous environments, where the reward for exactly reaching a goal state becomes infinitely sparse. Successor states satisfy more than just the Bellman equation: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
