Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective
Zeyu Zhang, Yi Su, Hui Yuan, Yiran Wu, Rishab Balasubramanian, Qingyun, Wu, Huazheng Wang, Mengdi Wang

TL;DR
This paper introduces CUOLR, a reinforcement learning approach that unifies off-policy learning to rank across various click models by modeling the ranking process as an MDP, enabling robust and model-agnostic learning.
Contribution
The paper proposes a novel MDP formulation for off-policy LTR that is agnostic to click models and applies offline RL techniques for improved robustness and versatility.
Findings
CUOLR outperforms existing algorithms on large-scale datasets.
It maintains robustness across different click models.
The method simplifies off-policy LTR without complex debiasing.
Abstract
Off-policy Learning to Rank (LTR) aims to optimize a ranker from data collected by a deployed logging policy. However, existing off-policy learning to rank methods often make strong assumptions about how users generate the click data, i.e., the click model, and hence need to tailor their methods specifically under different click models. In this paper, we unified the ranking process under general stochastic click models as a Markov Decision Process (MDP), and the optimal ranking could be learned with offline reinforcement learning (RL) directly. Building upon this, we leverage offline RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a wide range of click models. Through a dedicated formulation of the MDP, we show that offline RL algorithms can adapt to various click models without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Economic and Environmental Valuation · Optimization and Search Problems
