Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

David Williams; Ioakim Avraam; Aldeida Aleti; Matias Martinez; Justyna Petke; Federica Sarro

arXiv:2603.11262·cs.SE·March 13, 2026

Unveiling Practical Shortcomings of Patch Overfitting Detection Techniques

David Williams, Ioakim Avraam, Aldeida Aleti, Matias Martinez, Justyna Petke, Federica Sarro

PDF

Open Access

TL;DR

This study benchmarks patch overfitting detection techniques in realistic scenarios, revealing that simple random sampling often outperforms state-of-the-art methods, highlighting the need for more effective solutions.

Contribution

It provides the first comprehensive benchmarking of POD methods using realistic datasets, demonstrating their limited practical effectiveness compared to random sampling.

Findings

01

Random sampling outperforms POD tools in most cases

02

Current POD techniques have limited practical benefit

03

Benchmarking should use realistic data and baselines

Abstract

Automated Program Repair (APR) can reduce the time developers spend debugging, allowing them to focus on other aspects of software development. Automatically generated bug patches are typically validated through software testing. However, this method can lead to patch overfitting, i.e., generating patches that pass the given tests but are still incorrect. Patch correctness assessment (also known as overfitting detection) techniques have been proposed to identify patches that overfit. However, prior work often assessed the effectiveness of these techniques in isolation and on datasets that do not reflect the distribution of correct-to-overfitting patches that would be generated by APR tools in typical use; thus, we still do not know their effectiveness in practice. This work presents the first comprehensive benchmarking study of several patch overfitting detection (POD) methods in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research