Bootstrapped Pre-training with Dynamic Identifier Prediction for   Generative Retrieval

Yubao Tang; Ruqing Zhang; Jiafeng Guo; Maarten de Rijke; Yixing Fan,; Xueqi Cheng

arXiv:2407.11504·cs.IR·July 17, 2024

Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval

Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan,, Xueqi Cheng

PDF

Open Access 1 Video

TL;DR

This paper introduces BootRet, a dynamic pre-training method for generative retrieval that updates document identifiers during training, leading to improved retrieval performance especially in zero-shot scenarios.

Contribution

The paper proposes BootRet, a novel bootstrapped pre-training approach that dynamically adjusts document identifiers, enhancing generative retrieval models beyond static identifier methods.

Findings

01

BootRet outperforms existing pre-training methods in retrieval tasks.

02

BootRet achieves strong zero-shot retrieval performance.

03

Dynamic identifier updating improves model memorization and relevance prediction.

Abstract

Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query. Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning. However, the full power of pre-training for generative retrieval remains underexploited due to its reliance on pre-defined static document identifiers, which may not align with evolving model parameters. In this work, we introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing memorization of the corpus. BootRet involves three key training phases: (i) initial identifier generation, (ii) pre-training via corpus indexing and relevance prediction tasks, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsALIGN