Offline Policy Evaluation for Reinforcement Learning with Adaptively   Collected Data

Sunil Madhow; Dan Qiao; Ming Yin; Yu-Xiang Wang

arXiv:2306.14063·cs.LG·May 2, 2024

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Sunil Madhow, Dan Qiao, Ming Yin, Yu-Xiang Wang

PDF

Open Access

TL;DR

This paper develops theoretical guarantees for offline reinforcement learning policy evaluation when data is collected adaptively, extending beyond the traditional i.i.d. data assumptions, and demonstrates minimax optimality and empirical behavior of estimators.

Contribution

It introduces a theoretical framework for the TMIS Offline Policy Evaluation estimator in adaptive data collection settings for tabular MDPs, providing high-probability bounds and optimality results.

Findings

01

High-probability, instance-dependent bounds on estimation error.

02

Recovery of minimax-optimal offline learning in adaptive settings.

03

Empirical analysis of estimator behavior under adaptive and non-adaptive data collection.

Abstract

Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control