# On Generalized Bellman Equations and Temporal-Difference Learning

**Authors:** Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton

arXiv: 1704.04463 · 2018-11-27

## TL;DR

This paper introduces a flexible scheme for setting the $\lambda$-parameters in off-policy TD learning using generalized Bellman equations, improving stability and allowing larger $\lambda$ values, with proven ergodicity and convergence properties.

## Contribution

It proposes a novel, more direct method for setting $\\lambda$ in off-policy TD learning based on eligibility traces, enhancing stability and flexibility.

## Key findings

- The scheme maintains bounded traces in off-policy TD learning.
- It establishes ergodicity of the joint state-trace process.
- It characterizes the convergence behavior of least-squares implementations.

## Abstract

We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the $\lambda$-parameters of TD, based on generalized Bellman equations. Our scheme is to set $\lambda$ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger $\lambda$ values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.04463/full.md

## Figures

49 figures with captions in the complete paper: https://tomesphere.com/paper/1704.04463/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/1704.04463/full.md

---
Source: https://tomesphere.com/paper/1704.04463