Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior   Policies

Jinlin Lai; Lixin Zou; Jiaxing Song

arXiv:2011.14359·cs.LG·December 1, 2020

Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies

Jinlin Lai, Lixin Zou, Jiaxing Song

PDF

Open Access

TL;DR

This paper addresses the challenge of optimally combining estimators from multiple behavior policies in off-policy evaluation, proposing methods to reduce variance and improve estimation accuracy in reinforcement learning applications.

Contribution

It introduces three novel methods for variance reduction in mixture estimators when combining multiple behavior policies in off-policy evaluation.

Findings

01

Methods effectively reduce Mean-Square Error in simulated recommender systems.

02

Proposed estimators are unbiased or asymptotically unbiased.

03

Experimental results demonstrate improved estimation accuracy.

Abstract

Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies. It is a crucial step towards safe reinforcement learning and has been used in advertisement, recommender systems and many other applications. In these applications, sometimes the offline data is collected from multiple behavior policies. Previous works regard data from different behavior policies equally. Nevertheless, some behavior policies are better at producing good estimators while others are not. This paper starts with discussing how to correctly mix estimators produced by different behavior policies. We propose three ways to reduce the variance of the mixture estimator when all sub-estimators are unbiased or asymptotically unbiased. Furthermore, experiments on simulated recommender systems show that our methods are effective in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Smart Grid Energy Management