Learning from eXtreme Bandit Feedback

Romain Lopez; Inderjit S. Dhillon; Michael I. Jordan

arXiv:2009.12947·stat.ML·February 24, 2021

Learning from eXtreme Bandit Feedback

Romain Lopez, Inderjit S. Dhillon, Michael I. Jordan

PDF

1 Video

TL;DR

This paper introduces a novel importance sampling estimator and a new algorithm, POXM, to effectively learn from bandit feedback in extremely large action spaces, significantly improving over existing methods in recommendation system tasks.

Contribution

The paper proposes the sIS estimator and the POXM algorithm, which together enable more effective learning from bandit feedback in large-scale recommendation systems, reducing bias and variance issues.

Findings

01

POXM outperforms baseline methods on XMC datasets.

02

The sIS estimator reduces variance in importance sampling.

03

POXM significantly improves reward estimation accuracy.

Abstract

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning from eXtreme Bandit Feedback· underline

Taxonomy

MethodsPruning