A Neural Corpus Indexer for Document Retrieval

Yujing Wang; Yingyan Hou; Haonan Wang; Ziming Miao; Shibin Wu; Hao; Sun; Qi Chen; Yuqing Xia; Chengmin Chi; Guoshuai Zhao; Zheng Liu; Xing Xie,; Hao Allen Sun; Weiwei Deng; Qi Zhang; Mao Yang

arXiv:2206.02743·cs.IR·February 14, 2023·49 cites

A Neural Corpus Indexer for Document Retrieval

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao, Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie,, Hao Allen Sun, Weiwei Deng, Qi Zhang, Mao Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Neural Corpus Indexer (NCI), an end-to-end neural network that directly generates document identifiers from queries, significantly improving recall in document retrieval tasks.

Contribution

The paper presents a novel sequence-to-sequence neural model with a prefix-aware decoder for end-to-end document retrieval, unifying training and indexing stages.

Findings

01

Achieved +21.4% Recall@1 on NQ320k dataset

02

Achieved +16.8% R-Precision on TriviaQA dataset

03

Demonstrated superior performance over baseline methods

Abstract

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

solidsea98/neural-corpus-indexer-nci
jax

Videos

A Neural Corpus Indexer for Document Retrieval· slideslive

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques