Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques
David Williams, Ioakim Avraam, Aldeida Aleti, Matias Martinez, Justyna Petke, Federica Sarro

TL;DR
This study benchmarks patch overfitting detection techniques in realistic scenarios, revealing that simple random sampling often outperforms state-of-the-art methods, highlighting the need for more effective solutions.
Contribution
It provides the first comprehensive benchmarking of POD methods using realistic datasets, demonstrating their limited practical effectiveness compared to random sampling.
Findings
Random sampling outperforms POD tools in most cases
Current POD techniques have limited practical benefit
Benchmarking should use realistic data and baselines
Abstract
Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research
