InPars Toolkit: A Unified and Reproducible Synthetic Data Generation   Pipeline for Neural Information Retrieval

Hugo Abonizio; Luiz Bonifacio; Vitor Jeronymo; Roberto Lotufo; Jakub; Zavrel; Rodrigo Nogueira

arXiv:2307.04601·cs.IR·July 11, 2023·1 cites

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub, Zavrel, Rodrigo Nogueira

PDF

Open Access 1 Repo

TL;DR

This paper introduces InPars Toolkit, a comprehensive, reproducible pipeline for synthetic data generation in neural information retrieval, enabling accessible research and benchmarking across multiple datasets and models.

Contribution

It provides a unified, open-source toolkit that reproduces and extends previous methods, supporting various LLMs, filtering, and reranking, with extensive data and model sharing.

Findings

01

Successfully reproduced InPars method on multiple datasets.

02

Generated over 2,000 GPU hours of synthetic data for 18 datasets.

03

Provided open access to data and models for community use.

Abstract

Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs -- which are not widely accessible -- and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zetaalphavector/inpars
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Machine Learning and Data Classification