Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking
Xiangsheng Li, Xiaoshu Chen, Kunliang Wei, Bin Hu, Lei Jiang, Zeqian, Huang, Zhanhui Kang

TL;DR
This paper introduces a method to pretrain a debiased language model for document ranking using large-scale click logs, improving retrieval performance and winning first place in a competitive benchmark.
Contribution
The paper proposes a novel pretraining approach leveraging user click logs and behavior features to reduce bias in language models for document ranking.
Findings
Validated effectiveness on Baidu click logs
Achieved top performance in WSDM Cup 2023
Outperformed models trained on simulated data
Abstract
Pre-trained language models have achieved great success in various large-scale information retrieval tasks. However, most of pretraining tasks are based on counterfeit retrieval data where the query produced by the tailored rule is assumed as the user's issued query on the given document or passage. Therefore, we explore to use large-scale click logs to pretrain a language model instead of replying on the simulated queries. Specifically, we propose to use user behavior features to pretrain a debiased language model for document ranking. Extensive experiments on Baidu desensitization click logs validate the effectiveness of our method. Our team on WSDM Cup 2023 Pre-training for Web Search won the 1st place with a Discounted Cumulative Gain @ 10 (DCG@10) score of 12.16525 on the final leaderboard.
| ID | feature |
|---|---|
| 1 | query length |
| 2 | document length |
| 3 | query frequency |
| 4 | number of hit words of query in document |
| 5 | BM25 score |
| 6 | TF-IDF score |
| Model ID | Method | Backbone | Pretrain step | Finetune step | Submission DCG |
|---|---|---|---|---|---|
| 1 | CTR pre-training | BERT-24 | 1700K | 5130 | 11.96214 |
| 2 | CTR pre-training | BERT-24 | 1700K | 4180 | unk |
| 3 | CTR pre-training | BERT-12 | 2150K | 5130 | 11.32363 |
| 4 | CTR pre-training | BERT-24 | 590K | 5130 | 11.94845 |
| 5∗ | CTR pre-training | BERT-24 | 1700K | 4180 | unk |
| 6 | Debiased CTR pre-training | BERT-24 | 1940K | 5130 | unk |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Multimodal Machine Learning Applications
Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking
Xiangsheng Li, Xiaoshu Chen, Kunliang Wei, Bin Hu, Lei Jiang, Zeqian Huang and Zhanhui Kang
Tencent Machine Learning Platform Search,
Shenzhen, Guangdong, China
(2023)
Abstract.
Pre-trained language models have achieved great success in various large-scale information retrieval tasks. However, most of pretraining tasks are based on counterfeit retrieval data where the query produced by the tailored rule is assumed as the user’s issued query on the given document or passage. Therefore, we explore to use large-scale click logs to pretrain a language model instead of replying on the simulated queries. Specifically, we propose to use user behavior features to pretrain a debiased language model for document ranking. Extensive experiments on Baidu desensitization click logs validate the effectiveness of our method. Our team on WSDM Cup 2023 Pre-training for Web Search won with a Discounted Cumulative Gain @ 10 (DCG@10) score of 12.16525 on the final leaderboard.
Neural IR; Pretrained language model; Document Ranking;
††journalyear: 2023††copyright: acmcopyright††conference: Proceedings of the SIXTEEN ACM International Conference on Web Search and Data Mining; February 27–March 3, 2022; Singapore††booktitle: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM ’23), February 27–March 3, 2022, Singapore††price: 15.00††doi: 10.1145/3488560.xxxxx††isbn: 978-1-4503-9132-0/22/02††ccs: Information systems Retrieval models and ranking
1. Introduction
Recent advances have shown that pre-trained language models (PTMs) such as BERT (Devlin et al., 2018), T5 (Raffel et al., 2020), GPT (Radford et al., 2018) can capture rich semantic information of text and achieve state-of-the-art performance on variance information retrieval tasks (Qiao et al., 2019; Padaki et al., 2020; Li et al., 2022). However, the pretraining objectives of various PTMs are only based on classical NLP targets (Devlin et al., 2018) (e.g., Masked Language Modeling and Next Sentence Prediction) and are not carefully explored to better adapt the downstream IR tasks. To address this problem, different pre-training methods with tailored IR objectives are proposed to obtain a better pre-trained language model for downstream IR finetuning tasks. Ma et al. (Ma et al., 2021b) proposed representative words prediction (ROP) task by assuming the sample word set with a higher query likelihood is more “representative” to the document. Besides, the dependencies between the inner structures in Wikipedia pages are also exploited to design pretraining tasks (Ma et al., 2021a; Wu et al., 2022) for IR and achieve remarkable retrieval performance compared to the traditional pre-trained language models. These experimental results strongly suggest that traditional PTMs are usually data-hungry on IR tasks, pre-training with suitable IR tasks can effectively boost the performance of IR tasks even on few-shot or zero-shot scenarios (Chen et al., 2022).
Despite the success of various IR-based pretraining objectives on PTMs, we observe that these objectives are mostly designed based on manually tailored counterfeit retrieval data where the produced query are assumed as the user’s issued query on the given document or passage. As large-scale click logs can be obtained costlessly, we argue that click logs can also be a good resource to pretrain IR-based PTMs. Since the queries are all from the submissions of real users and more close to the query distribution in the downstream IR tasks, designing IR-based pretraining objectives based on click logs provide great potential to improve the downstream IR tasks.
In this work, we explore to use large-scale click logs to pretrain a language model with IR-based objectives. Specifically, we design a CTR prediction task and debiased CTR prediction task as our IR-based pretraining objectives. Furthermore, we extract other sparse features (e.g., BM25, document length, query frequency) and feed them into an ensemble learning model to rerank the candidate list. Our team on WSDM Cup 2023 Pre-training for Web Search won with a Discounted Cumulative Gain @ 10 (DCG@10) score of 12.16525 on the final leaderboard111https://aistudio.baidu.com/aistudio/competition/detail/536/0/leaderboard.
2. METHODOLOGY
In this section, we present our pipeline which obtains the best ranking performance on the final leaderboard, including four steps: 1) Pre-training with CTR prediction loss; 2) Debiased Pre-training with user behaviors features; 3) Finetuning with pairwise ranking loss; 4) Extracting learning to rank features for each query-document pair; 5) Ensemble learning with all features. We use BERT (Devlin et al., 2018) as our backbone and feed the concatenation of query, title and content into BERT-based reranker to predict the relevance score.
2.1. Pre-training with CTR prediction loss
Click logs contain rich user behavior information which provides potential relevant query-document pairs. These pairs can be employed to build IR-based pre-training objectives. We do not use the officially provided pointwise CTR prediction loss as we found it will magnify click bias and lead to weak ranking performance. Instead, we use a groupwise CTR prediction loss where the relevance of a clicked document is expected to be higher than other non-clicked documents. The loss is designed as follows:
[TABLE]
where is the group size and a group contains exactly one clicked document and non-clicked documents. Since not all documents in the candidate list are used during pre-training, it can reduce the click bias and yield a better ranking performance compared to pointwise CTR prediction loss. Besides, we also use a masked language modeling (MLM) loss as our pre-training objective.
[TABLE]
where is the masked tokens of a given sentence and is the rest tokens. denotes the parameters of the language model. In particular, we use whole word masking strategy by the provided unigram_dict instead of single word masking. It is shown better performance in various Chinese NLP tasks (Cui et al., 2021).
The final pre-training objective is constructed as follows:
[TABLE]
2.2. Debiased Pre-training with user behaviors features
To better reduce the impact of click bias during pre-training, we exploit other behavior features to build the IR-based pre-training objectives. Specifically, we use dwell time to filter pre-training group in Section 2.1, where the dwell time of clicked document should be longer than other non-clicked documents with a given threshold . The groupwise ctr prediction loss is
[TABLE]
where is the dwell time of a document. In this way, the training samples are with better quality and the clicked document is more confident to be the positive sample. Note, the initialized checkpoint of this task is from Equation 3.
2.3. Finetuning with margin ranking loss
After pre-training the language model with IR-based objectives, we finetune our model with manually annotated dataset, where each candidate document is marked with a five-level relevance. We employ margin ranking loss to finetune our model.
[TABLE]
where is sampled from the documents with relevance higher than or equal to 2 and is sampled from the documents with relevance lower than that of . is set as 1 in our work.
2.4. Learning to rank features
In our work, we also extract other learning to rank features from the query-document pairs as the learning to rank features, as shown in Table 1. Specifically, we take the title and content as the whole document. All document features are computed based on title and content.
2.5. Ensemble Learning
With the learning to rank features and the predicted scores of BERT-based reranker, we feed them into LambdaMart to ensemble the ability of different models. LambdaMart is a state-of-the-art supervised ranker that won the Yahoo! Learning to Rank Challenge (2010) (Wu et al., 2010). We finally aggregate six learning to rank features in Table 1 as well as the predicted scores from BERT-based rerankers in Table 2. We first use cross validation to determine the parameters of LambdaMart and then train LambdaMart on the whole validation set. The detailed procedure is as follows:
- (1)
Cross validation: We use Five-fold cross validation to determine the parameters of LambdaMart. Besides, we choose models based on cross validation and finally exclude model 5 since it does not improve ranking performance. 2. (2)
Train and inference: With the determined parameters of LambdaMart and selected rerankers, we train LambdaMart on the whole whole validation set and then calculate the relevance scores in the test set.
3. Experiment
3.1. Experimental settings
In our work, we select BERT-base (12 layers) and BERT-large (24 layers) as our backbone and use a linear layer to predict the relevance score. Masking ratio in Equation 2 is set as 0.15. in Equation 4 is set as 8 seconds. The parameters of LambdaMart after cross validation are: 300 training epochs, 100 leaves, 0.05 learning rate. We released our code implemented by Pytorch and PaddlePaddle at https://github.com/lixsh6/Tencent_wsdm_cup2023.
3.2. Results
We list experimental results under different settings in Table 2. denotes that we did not submit it to the leaderboard. We can observe that using BERT-large (24 layers) can achieve better ranking performance compared to BERT-base (12 layers). In particular, debiased CTR pre-training (Model 6) can achieve better performance than CTR pre-training in our cross validation experiments, as shown in the next subsection. Since we did not submit it to the leaderboard, we analyze it by visualizing feature importance of LambdaMart. Finally, we feed 6 six learning to rank feature in Table 1 and five BERT-based prediction scores in Table 2 (exclude Model 5) into LambdaMart, the DCG@10 on leaderboard achieves 12.16525.
3.3. Feature importance
We visualize the feature importance of LambdaMart in Figure 1. We find the prediction scores of most BERT-based rerankers cover most of weights in LambdaMart, which suggests that pretrained language model is an important reranker compared to traditional IR methods (e.g., BM25, TF-IDF). In addition, we notice that model 6 plays the most important role in LambdaMart with about twice weights of the second rank. It illustrates that debiased CTR pre-training can effectively boost ranking performance compared to the traditional CTR pre-training.
4. Conclusion
In this paper, we introduce our method on WSDM Cup 2023 Pre-training for Web Search which won with a Discounted Cumulative Gain @ 10 (DCG@10) score of 12.16525 on the final leaderboard. We have the following conclusions:
- (1)
Pre-training with groupwise CTR prediction loss leads to a better ranking performance in the downstream task compared to pointwise CTR prediction loss. It is due to the high click bias if modeling on the full document list. 2. (2)
Whole word masking can effectively boost the ranking performance. 3. (3)
Debiased Pre-training with user behaviors features can effectively reduce the click bias in the click logs, leading to a better pretrained language model. 4. (4)
Using BERT-large reranker can achieve better ranking performance than BERT-base reranker.
Besides, we also attempt other popular pre-training strategies such as retrieval-oriented pretraining with decoders (Liu and Shao, 2022), T5 as reranker (Zhuang et al., 2022), etc. But we do not find their effectiveness in this competition task. We believe click logs can be a valuable resource to pretrain an IR-based language model and look forward to studying more this area in future work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Chen et al . (2022) Jia Chen, Yiqun Liu, Yan Fang, Jiaxin Mao, Hui Fang, Shenghao Yang, Xiaohui Xie, Min Zhang, and Shaoping Ma. 2022. Axiomatically Regularized Pre-training for Ad hoc Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1524–1534.
- 3Cui et al . (2021) Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504–3514.
- 4Devlin et al . (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805 (2018).
- 5Li et al . (2022) Xiangsheng Li, Jiaxin Mao, Weizhi Ma, Zhijing Wu, Yiqun Liu, Min Zhang, Shaoping Ma, Zhaowei Wang, and Xiuqiang He. 2022. A Cooperative Neural Information Retrieval Pipeline with Knowledge Enhanced Automatic Query Reformulation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining . 553–561.
- 6Liu and Shao (2022) Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. ar Xiv preprint ar Xiv:2205.12035 (2022).
- 7Ma et al . (2021 b) Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xiang Ji, and Xueqi Cheng. 2021 b. Prop: Pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 14th ACM international conference on web search and data mining . 283–291.
- 8Ma et al . (2021 a) Zhengyi Ma, Zhicheng Dou, Wei Xu, Xinyu Zhang, Hao Jiang, Zhao Cao, and Ji-Rong Wen. 2021 a. Pre-training for ad-hoc retrieval: hyperlink is also you need. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management . 1212–1221.
