Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Andrzej Ruszczynski; Tiangang Zhang

arXiv:2605.00654·cs.LG·May 4, 2026

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Andrzej Ruszczynski, Tiangang Zhang

PDF

TL;DR

This paper introduces mini-batch Markov coherent risk measures and multipattern risk-averse problems, developing a feature-based Q-learning method with regret bounds and practical variants, demonstrated on stochastic and bandit problems.

Contribution

It proposes a novel class of risk measures and a Q-learning algorithm with theoretical guarantees for risk-averse Markov decision processes.

Findings

01

High-probability regret bound of O(H^2 N^H √K) for the proposed method.

02

Economical Q-learning variant streamlining policy evaluation.

03

Empirical validation on stochastic assignment and multi-armed bandit problems.

Abstract

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$ -learning method with multipattern $Q$ -factor approximation and we prove a high-probability regret bound of $O (H^{2} N^{H} K)$ , where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes. We also propose an economical version of the $Q$ -learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.