Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR
Kaiwen Wang, Nathan Kallus, Wen Sun

TL;DR
This paper develops near-minimax-optimal algorithms for risk-sensitive reinforcement learning with CVaR, providing tight regret bounds in multi-arm bandits and tabular MDPs, and introduces novel bonus-driven methods.
Contribution
It introduces new algorithms with optimal regret bounds for CVaR-based RL, including a Bernstein bonus for bandits and a bonus-driven value iteration for MDPs, improving existing bounds.
Findings
Achieves minimax CVaR regret rate of in bandits.
Establishes a lower bound of in tabular MDPs.
Proposes algorithms that attain near-optimal regret bounds under CVaR risk measure.
Abstract
In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance . Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is , where is the number of actions and is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of (with normalized cumulative rewards), where is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of under a continuity assumption and in general attains a near-optimal regret of $\widetilde…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
