# Pretraining De-Biased Language Model with Large-scale Click Logs for   Document Ranking

**Authors:** Xiangsheng Li, Xiaoshu Chen, Kunliang Wei, Bin Hu, Lei Jiang, Zeqian, Huang, Zhanhui Kang

arXiv: 2302.13498 · 2023-02-28

## TL;DR

This paper introduces a method to pretrain a debiased language model for document ranking using large-scale click logs, improving retrieval performance and winning first place in a competitive benchmark.

## Contribution

The paper proposes a novel pretraining approach leveraging user click logs and behavior features to reduce bias in language models for document ranking.

## Key findings

- Validated effectiveness on Baidu click logs
- Achieved top performance in WSDM Cup 2023
- Outperformed models trained on simulated data

## Abstract

Pre-trained language models have achieved great success in various large-scale information retrieval tasks. However, most of pretraining tasks are based on counterfeit retrieval data where the query produced by the tailored rule is assumed as the user's issued query on the given document or passage. Therefore, we explore to use large-scale click logs to pretrain a language model instead of replying on the simulated queries. Specifically, we propose to use user behavior features to pretrain a debiased language model for document ranking. Extensive experiments on Baidu desensitization click logs validate the effectiveness of our method. Our team on WSDM Cup 2023 Pre-training for Web Search won the 1st place with a Discounted Cumulative Gain @ 10 (DCG@10) score of 12.16525 on the final leaderboard.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13498/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/2302.13498/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/2302.13498/full.md

---
Source: https://tomesphere.com/paper/2302.13498