On Learning to Rank Long Sequences with Contextual Bandits
Anirban Santara, Claudio Gentile, Gaurav Aggarwal, Shuai Li

TL;DR
This paper introduces a new model for learning to rank long sequences using contextual bandits, providing theoretical guarantees and demonstrating improved empirical performance on real datasets.
Contribution
It proposes a novel cascading bandit variant for long sequences, with new algorithms and tight regret bounds, advancing the state-of-the-art in sequence ranking.
Findings
Tighter regret bounds than previous models.
Significant empirical improvements on real-world datasets.
Effective handling of variable-length sequences.
Abstract
Motivated by problems of learning to rank long item sequences, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to vanilla cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms on a number of real-world datasets, and show significantly improved empirical performance as compared to known cascading bandit baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
