Online Learning with Diverse User Preferences
Chao Gan, Jing Yang, Ruida Zhou, Cong Shen

TL;DR
This paper demonstrates that in a stochastic linear bandit setting with diverse user preferences, the regret can be reduced from logarithmic to constant by leveraging the diversity, with a proposed W-UCB algorithm achieving this under certain conditions.
Contribution
The paper introduces a novel analysis showing constant regret in linear bandits with diverse preferences and proposes the W-UCB algorithm to achieve this.
Findings
W-UCB achieves constant regret with diverse user preferences.
Diversity in user preferences accelerates convergence of arm estimates.
Performance validated with synthetic data.
Abstract
In this paper, we investigate the impact of diverse user preference on learning under the stochastic multi-armed bandit (MAB) framework. We aim to show that when the user preferences are sufficiently diverse and each arm can be optimal for certain users, the O(log T) regret incurred by exploring the sub-optimal arms under the standard stochastic MAB setting can be reduced to a constant. Our intuition is that to achieve sub-linear regret, the number of times an optimal arm being pulled should scale linearly in time; when all arms are optimal for certain users and pulled frequently, the estimated arm statistics can quickly converge to their true values, thus reducing the need of exploration dramatically. We cast the problem into a stochastic linear bandits model, where both the users preferences and the state of arms are modeled as {independent and identical distributed (i.i.d)}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Distributed Sensor Networks and Detection Algorithms · Data Stream Mining Techniques
