Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes
Juan Sebastian Rojas, Chi-Guhn Lee

TL;DR
This paper introduces RED reinforcement learning, a novel framework for solving multiple subtasks in average-reward MDPs, enabling risk-aware decision-making like CVaR optimization in an online setting.
Contribution
The work presents the first effective RL algorithms for average-reward MDPs that handle multiple subtasks and incorporate risk measures such as CVaR without complex optimization schemes.
Findings
Proposed RED algorithms are proven to converge in tabular cases.
Demonstrated online CVaR optimization in average-reward MDPs.
Unified framework for multiple subtasks in average-reward RL.
Abstract
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various learning objectives, or subtasks, simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes,…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper studies an interesting problem in average-reward RL, which leverages a structural property that is specific to average-reward MDPs. The introduced framework appears interesting in its generic form, although its presentation in the paper is done in a rather high and abstract level. I found its application to CVaR RL quite interesting. In addition, that it removes the need to solve bi-level optimization problems explicitly is definitely a plus. The paper is mostly well-organized and w
Main Comments: - - One main comment is regarding the assumption. In view of statements in line 123-124, it appears to me that effectively a unichain assumption is made both for prediction and control. - As a weak aspect, the presented framework only is shown to enjoy asymptotic convergence (in the tabular case). - Regarding CVaR RL, use of an augmented state-space is mentioned as a standard technique. Of course, it is clear that we lack interest in extending state-space – especially if there
(a) The abstract, introduction, and preliminaries on average reward reinforcement learning are well-written and clearly presented. (b) The TD and Q-learning with stochastic approximation algorithms, along with Theorems 4.1–4.3, appear to be rigorously verified with proofs in Appendix B. These proofs effectively extend the results from [1, 2, 3] to the multi-subtasks setting proposed in this work. References: [1] Wan, Yi, Abhishek Naik, and Richard S. Sutton. "Learning and planning in average
Despite the authors' in-depth understanding of stochastic approximation and model-free Q-learning proofs, the paper lacks sufficient validation regarding the extension to risk awareness in average reward MDPs. (a) The paper demonstrates a limited engagement with prior work and foundational concepts in risk-averse CVaR MDPs. The authors inaccurately claim that “our work is the first to propose an MDP-based CVaR optimization algorithm that does not require an explicit bi-level optimization scheme
- The paper is easy to read, and the writing skills are good. - I am unaware of previous work that proposed CVaR optimization for the average reward criterion, so this is original (as far as I can tell).
I reviewed this paper for another venue where reviewers voted for rejection unanimously. There have not been any substantial updates since that submission, so my concerns apply to this one. I copy below the most critical concerns I had then: - Even in the risk-neutral case, the average reward criterion has some analytical advantages. Notably, [2] focuses on that same criterion rather than the discounted one. As a side comment, I am unaware of any provably convergent AC algorithms for the risk
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
