A Reproducibility and Generalizability Study of Large Language Models for Query Generation
Moritz Staudinger, Wojciech Kusa, Florina Piroi, Aldo Lipani, and Allan Hanbury

TL;DR
This study evaluates the reproducibility and generalizability of large language models like ChatGPT and open-source alternatives in generating Boolean queries for systematic literature reviews, revealing their strengths and limitations.
Contribution
It provides a comprehensive analysis of LLMs for query generation, comparing multiple models and assessing their reliability and effectiveness in automating literature review tasks.
Findings
ChatGPT results are reproducible and consistent
Open-source models show comparable performance
Identified limitations and areas for improvement in LLM-based query generation
Abstract
Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and large language models (LLMs) promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation. Therefore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
