InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval
Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto, Lotufo, Jakub Zavrel, Rodrigo Nogueira

TL;DR
InPars-v2 leverages open-source large language models and rerankers to generate synthetic query-document pairs, significantly improving information retrieval performance and setting new benchmarks.
Contribution
This work introduces InPars-v2, an open-source dataset generator using open-source LLMs and rerankers, advancing the efficiency and effectiveness of training retrieval models.
Findings
State-of-the-art results on BEIR benchmark
Open-source code and data available for research
Effective use of open-source LLMs for dataset generation
Abstract
Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inpars/monot5-3b-inpars-v2-arguana-promptagatormodel· 1 dl1 dl
- 🤗inpars/monot5-3b-inpars-v2-fiqa-promptagatormodel· 2 dl2 dl
- 🤗inpars/monot5-3b-inpars-v2-fever-promptagatormodel· 3 dl· ♡ 13 dl♡ 1
- 🤗inpars/monot5-3b-inpars-v2-nfcorpus-promptagatormodel· 1 dl1 dl
- 🤗inpars/monot5-3b-inpars-v2-scifact-promptagatormodel· 2 dl· ♡ 22 dl♡ 2
- 🤗inpars/monot5-3b-inpars-v2-hotpotqa-promptagatormodel· 1 dl1 dl
- 🤗inpars/monot5-3b-inpars-v2-trec-covid-promptagatormodel· 1 dl1 dl
- 🤗inpars/monot5-3b-inpars-v2-quora-promptagatormodel· 1 dl· ♡ 11 dl♡ 1
- 🤗inpars/monot5-3b-inpars-v2-nq-promptagatormodel· 3 dl· ♡ 33 dl♡ 3
- 🤗inpars/monot5-3b-inpars-v2-webis-touche2020-promptagatormodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Linear Layer · Layer Normalization · Softmax · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · {Dispute@FaQ-s}How to file a dispute with Expedia?
