Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
Yang Xu, Swetha Ganesh, Vaneet Aggarwal

TL;DR
This paper develops non-asymptotic convergence guarantees for robust $Q$-learning and actor-critic algorithms in average reward MDPs under distributional uncertainties, enabling efficient robust policy learning.
Contribution
It introduces a novel contraction property of the robust $Q$ operator and provides sample-efficient algorithms for robust policy optimization under TV and Wasserstein uncertainties.
Findings
Optimal robust $Q$-operator is a strict contraction.
Algorithms achieve $ ilde{O}( ext{epsilon}^{-2})$ sample complexity.
Numerical simulations demonstrate effectiveness.
Abstract
We present a non-asymptotic convergence analysis of -learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust -function using samples. We also provide an efficient routine for robust -function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an -optimal robust policy within samples. We provide numerical simulations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Risk and Portfolio Optimization
