Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Gugan Thoppe, L. A. Prashanth, Ankur Naskar, and Sanjay Bhat

TL;DR
This paper develops and analyzes new reinforcement learning algorithms for exponential utility optimization in discounted MDPs, establishing their convergence and optimality properties.
Contribution
It introduces two novel Q-value-style algorithms with convergence guarantees for exponential utility in discounted MDPs, filling a key gap in principled value-based RL methods.
Findings
Proved contraction properties of the operators in specific metrics.
Established almost-sure convergence of the two-timescale Q-learning algorithm.
Provided finite-time convergence rates and analyzed challenges for the sublinear operator.
Abstract
Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
