Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank
Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke

TL;DR
This paper introduces PRPO, a new safe learning method for ranking models that guarantees performance limits without relying on user behavior assumptions, enhancing robustness and safety in real-world applications.
Contribution
PRPO is the first method to provide unconditional safety in counterfactual ranking, removing the need for user behavior assumptions and improving deployment safety.
Findings
PRPO outperforms existing safe inverse propensity scoring methods.
PRPO maintains safety even in adversarial scenarios.
PRPO achieves higher performance while ensuring safety.
Abstract
Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that PRPO provides higher performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making · Imbalanced Data Classification Techniques
