Double Pessimism is Provably Efficient for Distributionally Robust   Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Jose Blanchet; Miao Lu; Tong Zhang; Han Zhong

arXiv:2305.09659·cs.LG·August 23, 2023·2 cites

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Jose Blanchet, Miao Lu, Tong Zhang, Han Zhong

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel double pessimism principle and a generic algorithm, P^2MPO, for distributionally robust offline reinforcement learning, demonstrating sample efficiency and tractability across various models and extending to robust Markov games.

Contribution

It proposes the first general double pessimism framework for robust offline RL, with provable efficiency and applicability to multiple RMDP types and robust Markov games.

Findings

01

P^2MPO achieves (n^{-1/2}) convergence rate.

02

First tractability proofs for factored, kernel, and neural RMDPs.

03

Extension of double pessimism to robust Markov games for Nash equilibrium.

Abstract

In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization ( $P^{2} M P O$ ), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that $P^{2} M P O$ is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Energy, Environment, and Transportation Policies · Adaptive Dynamic Programming Control