Off-policy Learning for Multiple Loggers
Li He, Long Xia, Wei Zeng, Zhi-Ming Ma, Yihong Zhao, and Dawei Yin

TL;DR
This paper develops off-policy learning methods for scenarios with multiple historical data sources, providing theoretical analysis and algorithms that outperform existing approaches in benchmark tests.
Contribution
It introduces a novel off-policy learning framework for multiple loggers, including generalization error bounds and a constrained optimization algorithm.
Findings
Outperforms state-of-the-art methods on benchmark datasets
Provides theoretical generalization error bounds for multi-logger off-policy learning
Develops a minimax-based algorithm for the constrained optimization problem
Abstract
It is well known that the historical logs are used for evaluating and learning policies in interactive systems, e.g. recommendation, search, and online advertising. Since direct online policy learning usually harms user experiences, it is more crucial to apply off-policy learning in real-world applications instead. Though there have been some existing works, most are focusing on learning with one single historical policy. However, in practice, usually a number of parallel experiments, e.g. multiple AB tests, are performed simultaneously. To make full use of such historical data, learning policies from multiple loggers becomes necessary. Motivated by this, in this paper, we investigate off-policy learning when the training data coming from multiple historical policies. Specifically, policies, e.g. neural networks, can be learned directly from multi-logger data, with counterfactual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Machine Learning and Algorithms
