InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems

Matey Krastev; Miklos Hamar; Danilo Toapanta; Jesse Brouwers; Yibin Lei

arXiv:2508.13930·cs.IR·August 20, 2025

InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems

Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei

PDF

TL;DR

This paper enhances synthetic query generation for neural information retrieval by introducing fine-tuning with Contrastive Preference Optimization and dynamic prompts, leading to improved retrieval performance and reproducibility.

Contribution

It extends the InPars framework with novel fine-tuning and prompt optimization techniques, advancing synthetic data quality for IR systems.

Findings

01

Extensions reduce filtering needs and improve retrieval accuracy.

02

Reproducibility of pipelines validated on SciFact benchmark.

03

Code and datasets publicly released for further research.

Abstract

This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.