Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing
Thanh Le-Cong, Dat Nguyen, Bach Le, Toby Murray

TL;DR
This paper advocates for evaluating neural program repair robustness using naturally-occurring data transformations, revealing their impact on performance and proposing an LLM-based naturalness assessment metric.
Contribution
It introduces a naturalness-focused robustness testing framework for NPR, including a human study on transformation naturalness and an LLM-based automatic assessment method.
Findings
Only 60% of transformations are natural according to human judgment.
NPR performance significantly drops on transformed datasets.
Different NPR techniques show varied robustness, indicating evaluation biases.
Abstract
In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability
MethodsFocus
