Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment
Zixian Yang, Xin Liu, Lei Ying

TL;DR
This paper introduces a new multi-armed bandit model incorporating user abandonment based on engagement, proposing algorithms with proven logarithmic regret and demonstrating superior performance over traditional methods.
Contribution
The paper develops the MAB-A model accounting for user abandonment and proposes ULCB and KL-ULCB algorithms with theoretical regret bounds and practical improvements.
Findings
Both algorithms achieve $O( ext{log} K)$ regret.
KL-ULCB's regret bound is asymptotically sharp.
Algorithms outperform traditional UCB, KL-UCB, and Q-learning in simulations.
Abstract
Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Smart Grid Energy Management
