InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems
Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei

TL;DR
This paper enhances synthetic query generation for neural information retrieval by introducing fine-tuning with Contrastive Preference Optimization and dynamic prompts, leading to improved retrieval performance and reproducibility.
Contribution
It extends the InPars framework with novel fine-tuning and prompt optimization techniques, advancing synthetic data quality for IR systems.
Findings
Extensions reduce filtering needs and improve retrieval accuracy.
Reproducibility of pipelines validated on SciFact benchmark.
Code and datasets publicly released for further research.
Abstract
This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
