Is Pessimism Provably Efficient for Offline RL?
Ying Jin, Zhuoran Yang, Zhaoran Wang

TL;DR
This paper introduces a pessimistic value iteration algorithm for offline RL that is provably efficient and minimax optimal, even without dataset coverage assumptions, by effectively handling dataset coverage issues through a simple penalty mechanism.
Contribution
It proposes a practical, implementable pessimistic algorithm for offline RL with theoretical guarantees that match lower bounds in linear MDPs, highlighting the importance of pessimism in offline RL.
Findings
PEVI achieves a data-dependent upper bound on suboptimality.
In linear MDPs, PEVI matches the information-theoretic lower bound.
Pessimism helps eliminate spurious correlations from irrelevant trajectories.
Abstract
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
