TL;DR
This study systematically evaluates the reproducibility of LLM-based query reformulation methods across diverse settings, revealing stability issues and the importance of retrieval paradigms, and provides an open toolkit for ongoing comparison.
Contribution
It offers a unified experimental framework for reproducibility, compares multiple LLM-based reformulation methods, and releases an open-source toolkit with a public leaderboard.
Findings
Reformulation gains depend heavily on the retrieval paradigm.
Improvements in lexical retrieval do not always transfer to neural retrievers.
Larger LLMs do not always improve downstream performance.
Abstract
Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
