On Reliability of Patch Correctness Assessment
Xuan Bach D. Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li

TL;DR
This paper evaluates the reliability of automated and author-based patch correctness annotations in automatic software repair by constructing a high-quality gold standard through a developer study.
Contribution
It introduces a systematic approach to assess annotation reliability using a gold set created via a user study with professional developers.
Findings
Automated annotation shows limited agreement with the gold set.
Author annotation has subjectivity issues affecting reliability.
The constructed gold set is comparable to other high-quality gold standards.
Abstract
Current state-of-the-art automatic software repair (ASR) techniques rely heavily on incomplete specifications, e.g., test suites, to generate repairs. This, however, may render ASR tools to generate incorrect repairs that do not generalize. To assess patch correctness, researchers have been following two typical ways separately: (1) Automated annotation, wherein patches are automatically labeled by an independent test suite (ITS) - a patch passing the ITS is regarded as correct or generalizable, and incorrect otherwise, (2) Author annotation, wherein authors of ASR techniques annotate correctness labels of patches generated by their and competing tools by themselves. While automated annotation fails to prove that a patch is actually correct, author annotation is prone to subjectivity. This concern has caused an on-going debate on appropriate ways to assess the effectiveness of numerous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research
