Towards Evaluation Guidelines for Empirical Studies involving LLMs

Stefan Wagner; Marvin Mu\~noz Bar\'on; Davide Falessi; and Sebastian; Baltes

arXiv:2411.07668·cs.SE·February 5, 2025·3 cites

Towards Evaluation Guidelines for Empirical Studies involving LLMs

Stefan Wagner, Marvin Mu\~noz Bar\'on, Davide Falessi, and Sebastian, Baltes

PDF

Open Access

TL;DR

This paper introduces the first comprehensive guidelines for conducting and evaluating empirical studies involving large language models in software engineering, aiming to standardize research quality and foster community discussion.

Contribution

It provides the initial set of holistic guidelines specifically tailored for empirical research involving LLMs in software engineering.

Findings

01

First set of guidelines for LLM-based empirical studies

02

Aims to improve research rigor and comparability

03

Encourages community discussion on standards

Abstract

In the short period since the release of ChatGPT, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of holistic guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of our standards for high-quality empirical studies involving LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Digital Rights Management and Security · Law, AI, and Intellectual Property