The Optimal Unbiased Value Estimator and its Relation to LSTD, TD and MC
Steffen Gr\"unew\"alder, Klaus Obermayer

TL;DR
This paper analytically derives the optimal unbiased value estimator (MVU), compares it with TD, MC, and LSTD, and explores their biases and relations in different Markov Reward Process structures.
Contribution
It provides a theoretical analysis of the MVU, establishes its relation to LSTD, TD, and MC, and clarifies conditions for unbiasedness and estimator risk ordering.
Findings
LSTD is equivalent to MVU in acyclic MRPs.
MC equals MVU and LSTD in undiscounted MRPs with equal information.
TD is unbiased in acyclic MRPs and biased in cyclic MRPs.
Abstract
In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The main reason being the probability measures with which the expectations are taken. These measure vary from state to state and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization · Control Systems and Identification · Advanced Multi-Objective Optimization Algorithms
