TL;DR
This paper develops a new algorithm for learning optimal policies from adaptively collected data, addressing the challenges of dependence and insufficient treatment observations, and provides theoretical guarantees and empirical validation.
Contribution
It introduces a generalized AIPW-based algorithm with variance control, establishing minimax regret bounds for adaptive data policy learning.
Findings
The algorithm achieves minimax rate optimal regret guarantees.
It effectively handles dependence in adaptively collected data.
Empirical results demonstrate improved policy learning performance.
Abstract
Learning optimal policies from historical data enables personalization in a wide variety of applications including healthcare, digital recommendations, and online education. The growing policy learning literature focuses on settings where the data collection rule stays fixed throughout the experiment. However, adaptive data collection is becoming more common in practice, from two primary sources: 1) data collected from adaptive experiments that are designed to improve inferential efficiency; 2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g. contextual bandits). Yet adaptivity complicates the optimal policy identification ex post, since samples are dependent, and each treatment may not receive enough observations for each type of individual. In this paper, we make initial research inquiries into addressing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
