Doubly Robust Policy Evaluation and Learning

Miroslav Dudik; John Langford; Lihong Li

arXiv:1103.4601·cs.LG·May 9, 2011·303 cites

Doubly Robust Policy Evaluation and Learning

Miroslav Dudik, John Langford, Lihong Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a doubly robust method for policy evaluation and learning in contextual bandits, combining reward and policy models to improve accuracy and reduce variance in decision-making tasks.

Contribution

It presents a novel doubly robust technique that outperforms existing methods by effectively leveraging models of rewards and policies, enhancing policy evaluation and optimization.

Findings

01

Doubly robust approach reduces variance in value estimates.

02

Method outperforms existing techniques in empirical tests.

03

Achieves more accurate policy evaluation and better decision policies.

Abstract

We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leoguelman/BLBF
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Advanced Causal Inference Techniques · Healthcare Operations and Scheduling Optimization