DocReLM: Mastering Document Retrieval with Language Model
Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin,, Han-Sen Zhong, Wanli Ouyang

TL;DR
DocReLM leverages large language models to significantly improve semantic understanding and retrieval accuracy in academic document search, outperforming existing systems like Google Scholar in quantum physics and computer vision.
Contribution
The paper introduces a novel document retrieval system that uses large language models for training and candidate identification, enhancing semantic understanding in academic search.
Findings
Top 10 accuracy of 44.12% in computer vision
Top 10 accuracy of 36.21% in quantum physics
Outperforms Google Scholar significantly in both fields
Abstract
With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
