DocReLM: Mastering Document Retrieval with Language Model

Gengchen Wei; Xinle Pang; Tianning Zhang; Yu Sun; Xun Qian; Chen Lin,; Han-Sen Zhong; Wanli Ouyang

arXiv:2405.11461·cs.IR·May 21, 2024·1 cites

DocReLM: Mastering Document Retrieval with Language Model

Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin,, Han-Sen Zhong, Wanli Ouyang

PDF

Open Access

TL;DR

DocReLM leverages large language models to significantly improve semantic understanding and retrieval accuracy in academic document search, outperforming existing systems like Google Scholar in quantum physics and computer vision.

Contribution

The paper introduces a novel document retrieval system that uses large language models for training and candidate identification, enhancing semantic understanding in academic search.

Findings

01

Top 10 accuracy of 44.12% in computer vision

02

Top 10 accuracy of 36.21% in quantum physics

03

Outperforms Google Scholar significantly in both fields

Abstract

With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training