InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval
Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub, Zavrel, Rodrigo Nogueira

TL;DR
This paper introduces InPars Toolkit, a comprehensive, reproducible pipeline for synthetic data generation in neural information retrieval, enabling accessible research and benchmarking across multiple datasets and models.
Contribution
It provides a unified, open-source toolkit that reproduces and extends previous methods, supporting various LLMs, filtering, and reranking, with extensive data and model sharing.
Findings
Successfully reproduced InPars method on multiple datasets.
Generated over 2,000 GPU hours of synthetic data for 18 datasets.
Provided open access to data and models for community use.
Abstract
Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs -- which are not widely accessible -- and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Machine Learning and Data Classification
