TL;DR
This paper introduces a doubly robust method for policy evaluation and optimization in contextual bandits, combining reward and policy models to improve accuracy and reduce variance in sequential decision-making tasks.
Contribution
It applies doubly robust estimation to policy evaluation and optimization, effectively balancing bias and variance issues in partial reward settings.
Findings
Doubly robust method outperforms existing techniques in accuracy.
Achieves lower variance in value estimation.
Leads to better policy performance in empirical tests.
Abstract
We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
