On the need to perform comprehensive evaluations of automated program repair benchmarks: Sorald case study

Sumudu Liyanage; Sherlock A. Licorish; Markus Wagner; Stephen G. MacDonell

arXiv:2508.15135·cs.SE·August 22, 2025

On the need to perform comprehensive evaluations of automated program repair benchmarks: Sorald case study

Sumudu Liyanage, Sherlock A. Licorish, Markus Wagner, Stephen G. MacDonell

PDF

Open Access

TL;DR

This paper highlights the importance of comprehensive evaluation of automated program repair tools, demonstrating through a case study that current assessments overlook critical side effects like new faults and code degradation.

Contribution

It introduces a framework for evaluating APR tools holistically and applies it to Sorald, revealing significant side effects not captured by traditional metrics.

Findings

01

Sorald fixed some violations but introduced 2,120 new faults.

02

Unit test failure rate increased by 24% after repairs.

03

Code structure was degraded, emphasizing the need for comprehensive evaluation.

Abstract

In supporting the development of high-quality software, especially necessary in the era of LLMs, automated program repair (APR) tools aim to improve code quality by automatically addressing violations detected by static analysis profilers. Previous research tends to evaluate APR tools only for their ability to clear violations, neglecting their potential introduction of new (sometimes severe) violations, changes to code functionality and degrading of code structure. There is thus a need for research to develop and assess comprehensive evaluation frameworks for APR tools. This study addresses this research gap, and evaluates Sorald (a state-of-the-art APR tool) as a proof of concept. Sorald's effectiveness was evaluated in repairing 3,529 SonarQube violations across 30 rules within 2,393 Java code snippets extracted from Stack Overflow. Outcomes show that while Sorald fixes specific rule…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Security and Verification in Computing