Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering
Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Lan Luo, Ke Zhan, Enrui Hu,, Xinyu Zhang, Hao Jiang, Zhao Cao, Fan Yu, Xin Jiang, Qun Liu, Lei Chen

TL;DR
This paper introduces HyperLink-induced Pre-training (HLP), a novel method that leverages hyperlink structures in web documents to improve passage retrieval for open-domain question answering, especially in low-data scenarios.
Contribution
HLP uses hyperlink-based topology to generate relevance signals for pre-training dense retrievers, bridging the gap between upstream signals and downstream relevance in QA tasks.
Findings
HLP outperforms BM25 by up to 7 points in zero-shot retrieval accuracy.
HLP surpasses other pre-training methods by more than 10 points in top-20 retrieval accuracy.
HLP is effective across various open-domain QA scenarios, including zero-shot, few-shot, multi-hop, and out-of-domain.
Abstract
To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems
