Pre-training Tasks for Embedding-based Large-scale Retrieval
Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar

TL;DR
This paper investigates pre-training tasks for embedding-based large-scale retrieval, demonstrating that well-designed paragraph-level pre-training significantly enhances retrieval performance over traditional methods like BM-25.
Contribution
It introduces and evaluates specific paragraph-level pre-training tasks that improve the effectiveness of Transformer-based embedding models for large-scale retrieval.
Findings
Pre-training tasks like ICT, BFS, and WLP improve retrieval accuracy.
Embedding-based Transformer models outperform BM-25 in retrieval tasks.
Proper pre-training is crucial for embedding models to excel in large-scale retrieval.
Abstract
We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
