Chaining Value Functions for Off-Policy Learning
Simon Schmitt, John Shawe-Taylor, Hado van Hasselt

TL;DR
This paper introduces a new family of stable off-policy prediction algorithms that build a chain of value functions, converging to the off-policy solution and improving stability in reinforcement learning.
Contribution
It proposes a novel chaining method for off-policy learning that guarantees convergence and approximates the off-policy TD solution, even in divergent cases.
Findings
Algorithm is stable and converges under mild conditions.
Approximates off-policy TD solutions in challenging scenarios.
Empirical results show improved stability on Baird's counterexample.
Abstract
To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Smart Grid Energy Management
