A Neural Corpus Indexer for Document Retrieval
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao, Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie,, Hao Allen Sun, Weiwei Deng, Qi Zhang, Mao Yang

TL;DR
This paper introduces Neural Corpus Indexer (NCI), an end-to-end neural network that directly generates document identifiers from queries, significantly improving recall in document retrieval tasks.
Contribution
The paper presents a novel sequence-to-sequence neural model with a prefix-aware decoder for end-to-end document retrieval, unifying training and indexing stages.
Findings
Achieved +21.4% Recall@1 on NQ320k dataset
Achieved +16.8% R-Precision on TriviaQA dataset
Demonstrated superior performance over baseline methods
Abstract
Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
