Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
Florian Angermeir, Maximilian Amougou, Mark Kreitz, Andreas Bauer, Matthias Linhuber, Davide Fucci, Fabiola Moy\'on C., Daniel Mendez, Tony Gorschek

TL;DR
This paper investigates the reproducibility of LLM-centric empirical studies in software engineering, revealing significant challenges and the need for better research practices to ensure reproducible results.
Contribution
It provides an analysis of recent LLM studies' reproducibility, identifies key impediments, and offers suggestions to improve research robustness and transparency.
Findings
Only 5 out of 18 studies were fully reproducible.
None of the fully reproducible studies could replicate results completely.
Reproducibility issues stem from incomplete artefacts and study design weaknesses.
Abstract
Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the around 425 publications at ICSE 2024 performed experiments with LLMs. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results, for both researchers and practitioners. One important step towards excelling in empirical research on LLM and their application is to first understand to what extent current research results are eventually reproducible and what factors may impede reproducibility. This investigation is within the scope of our work. We contribute an analysis of the reproducibility of LLM-centric studies, provide insights into the factors impeding reproducibility, and discuss suggestions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
