Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning
Xihong Su

TL;DR
This paper introduces new algorithms and theoretical insights for risk-averse reinforcement learning, including policy optimization, convergence conditions, and model-free Q-learning methods for uncertain environments.
Contribution
It presents a novel connection between policy gradient and dynamic programming, establishes conditions for contraction of ERM Bellman operators, and develops risk-averse Q-learning algorithms.
Findings
CADP guarantees monotone policy improvements.
Existence of stationary deterministic optimal policies proven.
Q-learning algorithms converge to risk-averse optimal policies.
Abstract
This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Game Theory and Applications
