Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank
Shashank Gupta, Harrie Oosterhuis, and Maarten de Rijke

TL;DR
This paper introduces a new safe ranking method, PRPO, that guarantees robust performance in counterfactual learning to rank without relying on user behavior assumptions, outperforming existing approaches especially in adversarial scenarios.
Contribution
The paper generalizes safe CLTR to modern methods and proposes PRPO, a novel safety approach that ensures unconditional safety during deployment without user behavior assumptions.
Findings
PRPO maintains safety even in adversarial conditions.
The generalized safe CLTR improves performance over existing methods.
PRPO outperforms safe inverse propensity scoring in experiments.
Abstract
Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. Our contributions are two-fold. First, we generalize the existing safe CLTR approach to make it applicable to state-of-the-art doubly robust CLTR and trust bias. Second, we propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
