TL;DR
This paper introduces a novel bandit algorithm that predicts delayed rewards to optimize long-term user satisfaction in recommender systems, demonstrated through a podcast recommendation case study.
Contribution
It develops a Bayesian filter-based predictive model for delayed rewards and a bandit algorithm that leverages this model to improve long-term recommendation quality.
Findings
Significantly outperforms short-term proxy optimization methods.
Effectively balances exploration and exploitation with delayed feedback.
Improves long-term engagement in podcast recommendations.
Abstract
Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become available might take several weeks, hurting the rate at which learning happens, whereas measuring short-term proxy rewards reflects the actual long-term goal only imperfectly. We address this challenge in two steps. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Full observations as well as partial (short or medium-term) outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
