TL;DR
This paper introduces a novel importance sampling estimator and a new algorithm, POXM, to effectively learn from bandit feedback in extremely large action spaces, significantly improving over existing methods in recommendation system tasks.
Contribution
The paper proposes the sIS estimator and the POXM algorithm, which together enable more effective learning from bandit feedback in large-scale recommendation systems, reducing bias and variance issues.
Findings
POXM outperforms baseline methods on XMC datasets.
The sIS estimator reduces variance in importance sampling.
POXM significantly improves reward estimation accuracy.
Abstract
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsPruning
