Policy evaluation from a single path: Multi-step methods, mixing and mis-specification
Yaqi Duan, Martin J. Wainwright

TL;DR
This paper provides non-asymptotic guarantees for kernel-based multi-step temporal difference methods in estimating value functions from a single Markov process trajectory, highlighting effects of model mis-specification and mixing time.
Contribution
It introduces a unified analysis of multi-step TD methods with non-asymptotic bounds, including mis-specification effects and optimality results, for data from a single trajectory.
Findings
Bounds depend on Bellman fluctuations and mixing time.
Mis-specification inflates statistical error, mitigated by look-ahead.
Minimax lower bounds show optimality of proposed methods.
Abstract
We study non-parametric estimation of the value function of an infinite-horizon -discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical -step look-ahead TD for and the TD family for as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Markov Chains and Monte Carlo Methods · Statistical Methods and Bayesian Inference
