Unbiased Learning to Rank: Counterfactual and Online Approaches
Harrie Oosterhuis, Rolf Jagerman, Maarten de Rijke

TL;DR
This paper provides a comprehensive overview and comparison of counterfactual and online unbiased learning to rank methods, highlighting their differences, advantages, and practical considerations for search systems.
Contribution
It offers an in-depth tutorial contrasting two main unbiased LTR methodologies, aiding practitioners in understanding and selecting appropriate approaches.
Findings
Counterfactual LTR learns from historical data with bias correction.
Online LTR uses randomized interactions to eliminate bias.
Both methods achieve unbiased ranking but differ in guarantees and user impact.
Abstract
This tutorial covers and contrasts the two main methodologies in unbiased Learning to Rank (LTR): Counterfactual LTR and Online LTR. There has long been an interest in LTR from user interactions, however, this form of implicit feedback is very biased. In recent years, unbiased LTR methods have been introduced to remove the effect of different types of bias caused by user-behavior in search. For instance, a well addressed type of bias is position bias: the rank at which a document is displayed heavily affects the interactions it receives. Counterfactual LTR methods deal with such types of bias by learning from historical interactions while correcting for the effect of the explicitly modelled biases. Online LTR does not use an explicit user model, in contrast, it learns through an interactive process where randomized results are displayed to the user. Through randomization the effect of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Unbiased Learning to Rank:
Counterfactual and Online Approaches
Tutorial Overview
Harrie Oosterhuis
University of AmsterdamAmsterdamThe Netherlands
,
Rolf Jagerman
University of AmsterdamAmsterdamThe Netherlands
and
Maarten de Rijke
University of AmsterdamAmsterdamThe Netherlands
Abstract.
This tutorial covers and contrasts the two main methodologies in unbiased Learning to Rank (LTR): Counterfactual LTR and Online LTR. There has long been an interest in LTR from user interactions, however, this form of implicit feedback is very biased. In recent years, unbiased LTR methods have been introduced to remove the effect of different types of bias caused by user-behavior in search. For instance, a well addressed type of bias is position bias: the rank at which a document is displayed heavily affects the interactions it receives. Counterfactual LTR methods deal with such types of bias by learning from historical interactions while correcting for the effect of the explicitly modelled biases. Online LTR does not use an explicit user model, in contrast, it learns through an interactive process where randomized results are displayed to the user. Through randomization the effect of different types of bias can be removed from the learning process. Though both methodologies lead to unbiased LTR, their approaches differ considerably, furthermore, so do their theoretical guarantees, empirical results, effects on the user experience during learning, and applicability. Consequently, for practitioners the choice between the two is very substantial. By providing an overview of both approaches and contrasting them, we aim to provide an essential guide to unbiased LTR so as to aid in understanding and choosing between methodologies.
1. Introduction
LTR has long been a core task in Information Retrieval (IR), as ranking models form the basis of most search and recommendation systems. Traditionally, LTR has been approached as a supervised task where there is a dataset with perfect relevance annotations (Liu, 2009). However, over time the limitations of this approach have become apparent. Most importantly, datasets are very expensive to create (Chapelle and Chang, 2011) and user preferences do not necessarily align with the annotations (Sanderson, 2010). As a result, interest in LTR from user interactions has increased significantly in recent years.
User interactions, often in the form of user clicks, provide implicit feedback (Joachims, 2002), and while cheap to collect, they are also heavily biased (Yue et al., 2010; Wang et al., 2018). The most prominent form of bias in ranking is position bias: users spend more attention to higher ranked documents, and consequently, the order in which documents are displayed considerably affects the interactions that take place (Wang et al., 2018). Another common form of bias is item selection bias: users can only interact with documents that are displayed, and as a result, the selection of displayed documents heavily affects which interactions are possible. Naively ignoring these biases during the learning process will result in biased ranking models that are not optimal for user preferences (Joachims et al., 2017). Thus, the field of LTR from user interactions is mainly focussed on methods that remove biases from the learning process, resulting in unbiased LTR.
The first approach to unbiased LTR is Counterfactual Learning to Rank (CLTR); it has its roots in user modeling (Chuklin et al., 2015). CLTR relies on a user model that models observance probabilities explicitly; this model can be inferred separately (Joachims et al., 2017; Agarwal et al., 2018; Carterette and Chandar, 2018) or jointly learned (Wang et al., 2016; Ai et al., 2018). By adjusting for observance probabilities, the effect of position bias can be removed from learning. This approach allows unbiased learning from historical data, i.e., interactions collected in the past, as long as an accurate user model can be inferred.
The second approach is Online Learning to Rank (OLTR), which optimizes by directly interacting with users (Yue and Joachims, 2009). Repeatedly, an OLTR method presents a user with a ranking, observes their interactions, and updates its ranking model accordingly. Initially, these methods were based around interleaving methods (Joachims, 2003) that compare rankers unbiasedly from clicks. Dueling Bandit Gradient Descent (DBGD) compares its current ranking model with a slight variation at each step, and updates toward the variation if such a preference is inferred (Yue and Joachims, 2009). This approach is related to existing bandit methods for online learning to re-rank (Katariya et al., 2016; Kveton et al., 2015; Lagrée et al., 2016). In contrast with DBGD, these reranking approaches do not learn ranking models that can be applied to unseen document and queries. While DBGD has long formed the basis of OLTR (Oosterhuis et al., 2016; Oosterhuis and de Rijke, 2017; Schuth et al., 2016; Hofmann et al., 2013a; Hofmann et al., 2013b; Zhao and King, 2016), recently fundamental problems with this approach were discovered (Oosterhuis and de Rijke, 2019). As a result, an alternative approach to OLTR was proposed: Pairwise Differentiable Gradient Descent (PDGD) (Oosterhuis and de Rijke, 2018). By not building on the Dueling Bandit approach PDGD avoids the problems recognized with DBGD while also displaying considerable performance gains. Thus OLTR promises a responsive learning process where ranking systems adapt to users automatically and continuously.
We see that a large shift in unbiased LTR has taken place in the last three years: the emergence of CLTR from the field of user modelling and the replacement of the DBGD approach with PDGD in OLTR. It is very important that practitioners and academics have a good understanding of each approach, their advantages, and limitations. Each approach has different theoretical properties and empirical findings show substantial performance differences depending on the circumstances. As a result, it is essential for LTR practitioners to understand the applicability and effectiveness of each method. As the field has recently advanced in these different directions, we argue this is the perfect time for a single tutorial to present the two approaches together to the IR community.
In this tutorial, we provide an overview of both CLTR and OLTR approaches and their underlying theory. We discuss the situations for which each approach has been designed, and the places were they are applicable. Furthermore, we compare the properties of the both approaches and give guidance on how the decision between them should be made. For the field of IR we aim to provide an essential guide on unbiased LTR to understanding and choosing between methodologies.
2. Objectives
The main objectives we wish to achieve with this tutorial are:
- •
Motivate the concept of unbiased LTR.
- •
Provide a complete overview of the two main approaches to unbiased LTR.
- •
Contrast the theoretical differences between the approaches, show the different fundamental assumptions they make.
- •
Give guidance on how a decision between the two approaches should be made, discuss their strengths and weaknesses and what conditions should be considered when deciding between them.
- •
Discuss future directions for unbiased LTR.
3. Relevance to the IR community
Many open questions remain to be addressed and there are many opportunities for the information retrieval community to benefit from and contribute to the area. Ever since the first publications on learning to rank (such as, e.g., (Fuhr and Buckley, 1991)), the well-known information retrieval conferences, such as SIGIR, CIKM, ECIR, WSDM, WWW, have seen follow-up work, as have related conferences, such as KDD, ICML, and NIPS. We estimate that in the last five years alone, hundreds of papers have been published on learning to rank.
As far as we are aware there has been no tutorial on unbiased LTR that brings the two angles (counterfactual and online) together, neither at SIGIR nor at any of the conferences listed above. There have been tutorials on counterfactual LTR, cf. (Ai et al., 2018; Joachims and Swaminathan, 2016), but they ignore online LTR. Similarly, existing tutorials on online LTR, cf. (Grotov and de Rijke, 2016; Oosterhuis, 2018) mostly ignore counterfactual LTR. Therefore, it appears this is the first tutorial to discuss and contrast both unbiased LTR methodologies comprehensively.
4. Format and Detailed Schedule
The tutorial will consists of two hours of lectures, split in two one-hour blocks by breaks.
Introduction (10 min)
Brief introduction on the limitations of supervised learning to rank, and biases in user interactions, so that the audience understands the need for unbiased LTR.
- **5 min – Limitations of the supervised approach
**Discuss the limitations of using annotated datasets (Liu, 2009), most importantly: they are expensive (Chapelle and Chang, 2011), they do not necessarily agree with users (Sanderson, 2010), and in some situations such a dataset cannot be constructed (Wang et al., 2016).
- **5 min – Learning from user interactions
**User interactions provide an alluring alternative: by learning from their behavior the true preferences of users may be found (Radlinski et al., 2008; Joachims, 2002). However, user interactions contain noise and biases (Yue et al., 2010), for reliable LTR position bias has to be countered. Similarly, in many places selection bias is unavoidable and has to be dealt with.
Counterfactual Learning to Rank (50 min)
The CLTR approach uses explicit user models to infer the probability that a document was observed separately. These observance probabilities then can be used to counter the effect of position bias.
- **15 min – Counterfactual evaluation
**Discuss the offline evaluation of online metrics using Inverse Propensity Scoring (IPS). We present the proof that IPS produces an unbiased estimate. IPS is the tool that underlies all of the CLTR methods, and it is important for the audience to have a good grasp of it.
- 10 min – Propensity-weighted LTR
Describe in detail propensity-weighted LTR methods (Joachims et al., 2017; Wang et al., 2016; Bendersky et al., 2018). Discuss the assumptions made by these methods and walk through the algorithms step-by-step.
- 15 min – Estimating position bias
Discuss position bias estimation techniques (Wang et al., 2018), which are necessary to compute the propensity scores used in all IPS-based learning algorithms. We focus on both online estimation of position bias (Wang et al., 2018) and offline estimation of position bias (Agarwal et al., 2018). Additionally, we briefly look at trust-bias and how it can be addressed (Agarwal et al., 2019).
- 10 min – Practical considerations
Highlight some of the practical difficulties and their solutions, such as high variance (Swaminathan and Joachims, 2015).
Online Learning to Rank (45 min)
OLTR methods learn by directly interacting with users, they deal with biases by adding stochasticity to the displayed results.
- 5 min – Online evaluation
Discuss interleaving and how it deals with position bias (Joachims, 2003; Hofmann et al., 2013c). Most of the initial OLTR methods rely on interleaving; it is important the audience understands this basis.
- **10 min – Dueling Bandit Gradient Descent
**Describe DBGD: the original OLTR method (Yue and Joachims, 2009) which is based on interleaving. This method defined a decade of OLTR algorithms.
- 5 min – Extensions of DBGD and their limitations
Many extensions of DBGD have been proposed (Oosterhuis et al., 2016; Oosterhuis and de Rijke, 2017; Schuth et al., 2016; Hofmann et al., 2013a; Hofmann et al., 2013b; Zhao and King, 2016), we will briefly describe some approaches and show that they do not lead to long-term improvements in performance.
- 10 min – Regret bounds of DBGD and their problems
The regret bounds of DBGD guarantee that its performance should eventually approximate the optimal performance. However, empirically we do not observe this behavior (Schuth et al., 2016; Oosterhuis and de Rijke, 2018). Recent work has found that the regret bounds rely on assumptions which are impossible for ranking problems (Oosterhuis and de Rijke, 2019). Understanding these issues may be very valuable for future work searching for regret bounds for ranking problems.
- 10 min – Pairwise Differentiable Gradient Descent
Latest OLTR method that does not rely on DBGD. Optimizes a probabilistic policy and deals with bias with some randomization in results. Proved to be unbiased w.r.t. position and selection bias (Oosterhuis and de Rijke, 2018).
- 10 min – Comparison of PDGD and DBGD
Discuss empirical comparisons between PDGD and DBGD which show PDGD outperforming DBGD in all experimental conditions (Oosterhuis and de Rijke, 2018, 2019). Compare PDGD and DBGD on a theoretical level to explain these differences.
Conclusion (15 min)
Conclude the tutorial by summarizing the previous sections and fully comparing and contrasting the three different approaches.
- 10 min – Summarize the two methodologies and their differences
Reflect on the two approaches to unbiased LTR, contrast their properties and applicability. Consider differences in theoretical properties and empirically observed performance (Jagerman et al., 2019). Recognize in which situations each method is more suited.
- 5 min – Future directions for unbiased learning to rank
We draw a picture of what current LTR methods can do for current applications, then, we identify problems with the current approach and speculate what potential solutions may look like. We finish by describing the promising directions that future LTR work could investigate.
5. Supplied Material
The slides will be made available to the public,111SIGIR’19 slides will be published on: http://ltr-tutorial-sigir19.isti.cnr.it/ we will include references to open source code from related work.
Acknowledgements.
This research was partially supported by Ahold Delhaize, the Association of Universities in the Netherlands (VSNU), the Innovation Center for Artificial Intelligence (ICAI), the Netherlands Organisation for Scientific Research (NWO) under project nr 612.001.551. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Agarwal et al . (2019) Aman Agarwal, Xuanhui Wang, Cheng Li, Michael Bendersky, and Marc Najork. 2019. Addressing Trust Bias for Unbiased Learning-to-Rank. In The World Wide Web Conference . ACM, 4–14.
- 3Agarwal et al . (2018) Aman Agarwal, Ivan Zaitsev, and Thorsten Joachims. 2018. Consistent position bias estimation without online interventions for learning-to-rank. ar Xiv preprint ar Xiv:1806.03555 (2018).
- 4Ai et al . (2018) Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. ar Xiv preprint ar Xiv:1804.05938 (2018).
- 5Bendersky et al . (2018) Mike Bendersky, Xuanhui Wang, Marc Najork, and Don Metzler. 2018. Learning with sparse and biased feedback for personal search. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI) . 5219–5223.
- 6Carterette and Chandar (2018) Ben Carterette and Praveen Chandar. 2018. Offline comparative evaluation with incremental, minimally-invasive online feedback. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval . ACM, 705–714.
- 7Chapelle and Chang (2011) Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge . 1–24.
- 8Chuklin et al . (2015) Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services 7, 3 (2015), 1–115.
