Emphatic TD Bellman Operator is a Contraction
Assaf Hallak, Aviv Tamar, Shie Mannor

TL;DR
This paper proves that the emphatic TD Bellman operator is a contraction, enabling the derivation of the first error bounds for off-policy evaluation algorithms under general policies.
Contribution
It demonstrates that the core operator in ETD is a contraction with a specific modulus, leading to new error bounds for off-policy evaluation.
Findings
The emphatic TD Bellman operator is a $ ext{sqrt}( ext{gamma})$-contraction.
Provides the first error bounds for off-policy evaluation with general policies.
Establishes theoretical guarantees for ETD's approximation accuracy.
Abstract
Recently, \citet{SuttonMW15} introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a -contraction modulus (where is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Neural Networks and Applications · Advanced Control Systems Optimization
