Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment
Sahand Moslemi, Mayasah Lami, Anil Koyuncu

TL;DR
Historian uses Large Language Models to automate and improve the correctness assessment of patches in Automated Program Repair, reducing manual effort and increasing reliability by leveraging historical validation data.
Contribution
Introduces Historian, a novel framework that employs LLMs for evidence-based, multi-reference patch assessment, addressing limitations of existing methods.
Findings
Achieves 95% coverage with 88.4% accuracy in patch validation.
Reduces manual validation to 5% of patches.
Enhances existing APCA tools by up to 21.8% in accuracy.
Abstract
Assessing the correctness of patches generated by Automated Program Repair (APR) is a major bottleneck. Manual validation is labor-intensive and limited: exact matching overlooks valid variants, while semantic inspection is subjective and hard to reproduce. Existing Automated Patch Correctness Assessment (APCA) often relies on opaque predictive models that treat each patch as novel, repeatedly re-assessing semantically redundant patches. Our analysis of a large corpus of tool-generated patches reveals a duality: about 39% of unique correct patches are syntactic clones, suggesting opportunities for automation, yet about 65% of bugs have multiple distinct correct fixes, making single-reference assessment insufficient. We present Historian, a framework that leverages Large Language Models to perform multi-reference comparisons against a knowledge base of historically validated patches,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research
