InPars-v2: Large Language Models as Efficient Dataset Generators for   Information Retrieval

Vitor Jeronymo; Luiz Bonifacio; Hugo Abonizio; Marzieh Fadaee; Roberto; Lotufo; Jakub Zavrel; Rodrigo Nogueira

arXiv:2301.01820·cs.IR·May 30, 2023·26 cites

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto, Lotufo, Jakub Zavrel, Rodrigo Nogueira

PDF

Open Access 1 Repo 10 Models

TL;DR

InPars-v2 leverages open-source large language models and rerankers to generate synthetic query-document pairs, significantly improving information retrieval performance and setting new benchmarks.

Contribution

This work introduces InPars-v2, an open-source dataset generator using open-source LLMs and rerankers, advancing the efficiency and effectiveness of training retrieval models.

Findings

01

State-of-the-art results on BEIR benchmark

02

Open-source code and data available for research

03

Effective use of open-source LLMs for dataset generation

Abstract

Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zetaalphavector/inpars
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Cosine Annealing · Linear Warmup With Cosine Annealing · Linear Layer · Layer Normalization · Softmax · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · {Dispute@FaQ-s}How to file a dispute with Expedia?