Learning from Logged Implicit Exploration Data
Alex Strehl, John Langford, Sham Kakade, Lihong Li

TL;DR
This paper develops a theoretical foundation for learning from logged implicit exploration data in contextual bandit problems, removing the need for explicit randomization or control during data collection.
Contribution
It introduces methods that enable policy learning from nonrandom, logged data without requiring explicit exploration policies or randomization, expanding applicability in real-world scenarios.
Findings
Validated on Yahoo! data sets showing effective policy learning
Achieved consistent and sound theoretical guarantees
Extended offline learning capabilities to nonrandom logged data
Abstract
We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which "offline" data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Data Stream Mining Techniques
