Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis
Jia Lin Hau, Erick Delage, Esther Derman, Mohammad Ghavamzadeh, Marek, Petrik

TL;DR
This paper introduces a new Q-learning algorithm for quantile optimization in MDPs, providing strong convergence guarantees and a simple DP decomposition that does not require known transition probabilities, outperforming previous methods.
Contribution
The paper presents a novel Q-learning algorithm for quantile MDPs with a simple DP decomposition that is model-free and computationally efficient.
Findings
Algorithm converges to its DP variant.
Outperforms earlier algorithms in tabular domains.
No need for known transition probabilities.
Abstract
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization · Fault Detection and Control Systems · Advanced Algorithms and Applications
MethodsQ-Learning
