Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning
Ye Shen, Hengrui Cai, Rui Song

TL;DR
This paper introduces a doubly robust interval estimation method, DREAM, for real-time evaluation of the optimal policy's value in online learning, addressing challenges like dependent data and exploration-exploitation trade-offs.
Contribution
The paper develops a novel doubly robust inference method, DREAM, that provides valid, asymptotically normal confidence intervals for the optimal policy value in online learning environments.
Findings
DREAM achieves valid confidence intervals in simulations.
The method performs well with real data applications.
It effectively handles dependent data and exploration probabilities.
Abstract
Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
