Optimistic Reinforcement Learning with Quantile Objectives

Mohammad Alipour-Vaezi; Huaiyang Zhong; Kwok-Leung Tsui; Sajad Khodadadian

arXiv:2511.09652·cs.LG·November 14, 2025

Optimistic Reinforcement Learning with Quantile Objectives

Mohammad Alipour-Vaezi, Huaiyang Zhong, Kwok-Leung Tsui, Sajad Khodadadian

PDF

Open Access

TL;DR

This paper introduces UCB-QRL, an optimistic reinforcement learning algorithm designed to optimize quantile-based objectives in finite-horizon MDPs, addressing risk sensitivity in RL applications like healthcare and finance.

Contribution

The paper develops a novel algorithm for risk-sensitive RL that optimizes quantile objectives and provides theoretical regret bounds in finite-horizon MDPs.

Findings

01

UCB-QRL achieves a high-probability regret bound of order ((2/7)^{H+1}H7 ext{SATH}7 ext{log}(2SATH/7)).

02

The algorithm effectively incorporates risk sensitivity through quantile optimization.

03

Theoretical analysis demonstrates the regret bounds depend on the problem's quantile sensitivity constant 7.

Abstract

Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $τ$ -quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $O ((2/ κ)^{H + 1} H S A T H lo g (2 S A T H / δ))$ in the episodic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Risk and Portfolio Optimization