Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning
Nathan Kallus, Masatoshi Uehara

TL;DR
This paper develops efficient evaluation methods for natural stochastic policies in offline reinforcement learning, addressing challenges when the evaluation policy is unknown and proposing estimators that achieve optimal error bounds.
Contribution
It derives efficiency bounds for tilting and modified treatment natural stochastic policies and introduces estimators that attain these bounds with partial double robustness.
Findings
Efficiency bounds for natural stochastic policies are established.
Proposed estimators achieve the efficiency bounds under mild conditions.
Estimators exhibit partial double robustness.
Abstract
We study the efficient off-policy evaluation of natural stochastic policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies. Crucially, offline reinforcement learning with natural stochastic policies can help alleviate issues of weak overlap, lead to policies that build upon current practice, and improve policies' implementability in practice. Compared with the classic case of a pre-specified evaluation policy, when evaluating natural stochastic policies, the efficiency bound, which measures the best-achievable estimation error, is inflated since the evaluation policy itself is unknown. In this paper, we derive the efficiency bounds of two major types of natural stochastic policies: tilting policies and modified treatment policies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
