Towards Evaluation Guidelines for Empirical Studies involving LLMs
Stefan Wagner, Marvin Mu\~noz Bar\'on, Davide Falessi, and Sebastian, Baltes

TL;DR
This paper introduces the first comprehensive guidelines for conducting and evaluating empirical studies involving large language models in software engineering, aiming to standardize research quality and foster community discussion.
Contribution
It provides the initial set of holistic guidelines specifically tailored for empirical research involving LLMs in software engineering.
Findings
First set of guidelines for LLM-based empirical studies
Aims to improve research rigor and comparability
Encourages community discussion on standards
Abstract
In the short period since the release of ChatGPT, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of holistic guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of our standards for high-quality empirical studies involving LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Digital Rights Management and Security · Law, AI, and Intellectual Property
