Comparing the Results of Replications in Software Engineering
Adrian Santos, Sira Vegas, Markku Oivo, Natalia Juristo

TL;DR
This paper investigates how to effectively compare and interpret replication results in software engineering, emphasizing the use of meta-analysis over simple p-value comparisons to understand discrepancies and contextual influences.
Contribution
It introduces simulation-based methods to evaluate replication similarity and advocates for meta-analysis as a better approach for assessing replication results in SE.
Findings
Direct comparison of p-values and effect sizes is inadequate.
Meta-analysis effectively assesses replication similarity.
Baseline experiment results should be integrated into a larger evidence context.
Abstract
Context: It has been argued that software engineering replications are useful for verifying the results of previous experiments. However, it has not yet been agreed how to check whether the results hold across replications. Besides, some authors suggest that replications that do not verify the results of previous experiments can be used to identify contextual variables causing the discrepancies. Objective: Study how to assess the (dis)similarity of the results of SE replications when they are compared to verify the results of previous experiments and understand how to identify whether contextual variables are influencing results. Method: We run simulations to learn how different ways of comparing replication results behave when verifying the results of previous experiments. We illustrate how to deal with context-induced changes. To do this, we analyze three groups of replications from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
