Policy Evaluation in Continuous MDPs with Efficient Kernelized Gradient Temporal Difference
Alec Koppel, Garrett Warnell, Ethan Stump, Peter Stone, Alejandro, Ribeiro

TL;DR
This paper introduces a memory-efficient, non-parametric stochastic method for policy evaluation in continuous MDPs, leveraging kernelized gradient TD learning to achieve faster convergence with less memory.
Contribution
It extends gradient temporal difference learning to a non-parametric, kernel-based setting with guaranteed convergence and improved efficiency in continuous state spaces.
Findings
Faster convergence to lower Bellman error in Mountain Car domain
Achieves convergence with finite memory and complexity
Outperforms existing methods in efficiency and accuracy
Abstract
We consider policy evaluation in infinite-horizon discounted Markov decision problems (MDPs) with infinite spaces. We reformulate this task a compositional stochastic program with a function-valued decision variable that belongs to a reproducing kernel Hilbert space (RKHS). We approach this problem via a new functional generalization of stochastic quasi-gradient methods operating in tandem with stochastic sparse subspace projections. The result is an extension of gradient temporal difference learning that yields nonlinearly parameterized value function estimates of the solution to the Bellman evaluation equation. Our main contribution is a memory-efficient non-parametric stochastic method guaranteed to converge exactly to the Bellman fixed point with probability with attenuating step-sizes. Further, with constant step-sizes, we obtain mean convergence to a neighborhood and that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
