Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces
Santiago Paternain, Juan Andr\'es Bazerque, Austin Small and, Alejandro Ribeiro

TL;DR
This paper introduces a novel stochastic policy gradient ascent algorithm for reinforcement learning in RKHS, featuring unbiased gradient estimates, variance reduction, and sparse representations, enabling efficient learning of low-complexity policies.
Contribution
The paper presents a new policy gradient method with unbiased estimates, variance reduction, and sparse RKHS representations, ensuring convergence and practical efficiency.
Findings
Successful learning of low-complexity policies
Convergence to stationary points demonstrated
Numerical examples confirm effectiveness
Abstract
Reinforcement learning consists of finding policies that maximize an expected cumulative long-term reward in a Markov decision process with unknown transition probabilities and instantaneous rewards. In this paper, we consider the problem of finding such optimal policies while assuming they are continuous functions belonging to a reproducing kernel Hilbert space (RKHS). To learn the optimal policy we introduce a stochastic policy gradient ascent algorithm with three unique novel features: (i) The stochastic estimates of policy gradients are unbiased. (ii) The variance of stochastic gradients is reduced by drawing on ideas from numerical differentiation. (iii) Policy complexity is controlled using sparse RKHS representations. Novel feature (i) is instrumental in proving convergence to a stationary point of the expected cumulative reward. Novel feature (ii) facilitates reasonable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
