Incentivized Bandit Learning with Self-Reinforcing User Preferences
Tianchen Zhou, Jia Liu, Chaosheng Dong, Jingyuan Deng

TL;DR
This paper introduces a novel multi-armed bandit model that incorporates user incentives and self-reinforcing preferences, proposing policies with logarithmic regret and payment bounds, validated through simulations.
Contribution
It presents a new MAB model considering incentives and self-reinforcing user preferences, along with two policies achieving logarithmic regret and payment bounds.
Findings
Both policies achieve $O(log T)$ expected regret.
Expected payment is also $O(log T)$ over time horizon T.
Simulations confirm robustness and effectiveness of the policies.
Abstract
In this paper, we investigate a new multi-armed bandit (MAB) online learning model that considers real-world phenomena in many recommender systems: (i) the learning agent cannot pull the arms by itself and thus has to offer rewards to users to incentivize arm-pulling indirectly; and (ii) if users with specific arm preferences are well rewarded, they induce a "self-reinforcing" effect in the sense that they will attract more users of similar arm preferences. Besides addressing the tradeoff of exploration and exploitation, another key feature of this new MAB model is to balance reward and incentivizing payment. The goal of the agent is to maximize the total reward over a fixed time horizon with a low total payment. Our contributions in this paper are two-fold: (i) We propose a new MAB model with random arm selection that considers the relationship of users' self-reinforcing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Optimization and Search Problems
