Doubly Robust Policy Evaluation and Learning
Miroslav Dudik, John Langford, Lihong Li

TL;DR
This paper introduces a doubly robust method for policy evaluation and learning in contextual bandits, combining reward and policy models to improve accuracy and reduce variance in decision-making tasks.
Contribution
It presents a novel doubly robust technique that outperforms existing methods by effectively leveraging models of rewards and policies, enhancing policy evaluation and optimization.
Findings
Doubly robust approach reduces variance in value estimates.
Method outperforms existing techniques in empirical tests.
Achieves more accurate policy evaluation and better decision policies.
Abstract
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Advanced Causal Inference Techniques · Healthcare Operations and Scheduling Optimization
