A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP
Tejaram Sangadi, L. A. Prashanth, Krishna Jagannathan

TL;DR
This paper provides finite-sample theoretical guarantees for a risk-sensitive actor-critic algorithm in reinforcement learning, analyzing convergence rates and bounds for mean-variance optimization in discounted MDPs.
Contribution
It introduces finite-sample bounds for a TD learning algorithm with linear function approximation and integrates SPSA-based actor updates, advancing understanding of risk-sensitive reinforcement learning methods.
Findings
Finite-sample bounds with exponential decay on initial error.
Convergence rate of O(1/t) for the TD algorithm.
O(n^{-1/4}) convergence guarantee for the actor-critic method.
Abstract
Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of after iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an convergence guarantee, where denotes the iterations of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer-related molecular mechanisms research
MethodsExponential Decay
