Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage
Jose Blanchet, Miao Lu, Tong Zhang, Han Zhong

TL;DR
This paper introduces a novel double pessimism principle and a generic algorithm, P^2MPO, for distributionally robust offline reinforcement learning, demonstrating sample efficiency and tractability across various models and extending to robust Markov games.
Contribution
It proposes the first general double pessimism framework for robust offline RL, with provable efficiency and applicability to multiple RMDP types and robust Markov games.
Findings
P^2MPO achieves (n^{-1/2}) convergence rate.
First tractability proofs for factored, kernel, and neural RMDPs.
Extension of double pessimism to robust Markov games for Nash equilibrium.
Abstract
In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization (), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Energy, Environment, and Transportation Policies · Adaptive Dynamic Programming Control
