Automated Program Repair: Emerging trends pose and expose problems for benchmarks
Joseph Renzullo, Pemma Reiter, Westley Weimer, Stephanie Forrest

TL;DR
This paper discusses how emerging machine learning techniques, especially large language models, are transforming automated program repair and highlights the challenges in evaluating these new approaches using existing benchmarks.
Contribution
It identifies the mismatch between current APR benchmarks and ML-based methods, emphasizing the need for better evaluation practices for LLM-driven repair techniques.
Findings
Existing benchmarks may be biased due to LLM training data overlap
ML-based APR methods are rapidly evolving and require new evaluation standards
Challenges in ensuring the validity and generalizability of results with ML techniques
Abstract
Machine learning (ML) now pervades the field of Automated Program Repair (APR). Algorithms deploy neural machine translation and large language models (LLMs) to generate software patches, among other tasks. But, there are important differences between these applications of ML and earlier work. Evaluations and comparisons must take care to ensure that results are valid and likely to generalize. A challenge is that the most popular APR evaluation benchmarks were not designed with ML techniques in mind. This is especially true for LLMs, whose large and often poorly-disclosed training datasets may include problems on which they are evaluated.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Radiation Effects in Electronics
