Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation
Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido, Zuccon, and Daxin Jiang

TL;DR
This paper introduces DSI-QG, a new framework that improves differentiable search indexes by generating relevant queries during indexing, thereby reducing data mismatch issues and enhancing retrieval performance across languages.
Contribution
The paper proposes DSI-QG, a novel indexing method that uses query generation and re-ranking to align training and retrieval data distributions in differentiable search indexes.
Findings
DSI-QG significantly outperforms original DSI models on multiple datasets.
Query generation during indexing improves cross-lingual retrieval accuracy.
Mitigating data distribution mismatch enhances overall retrieval effectiveness.
Abstract
The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Information Retrieval and Search Behavior · Advanced Image and Video Retrieval Techniques
