Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment

Sahand Moslemi; Mayasah Lami; Anil Koyuncu

arXiv:2603.00649·cs.SE·March 3, 2026

Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment

Sahand Moslemi, Mayasah Lami, Anil Koyuncu

PDF

Open Access

TL;DR

Historian uses Large Language Models to automate and improve the correctness assessment of patches in Automated Program Repair, reducing manual effort and increasing reliability by leveraging historical validation data.

Contribution

Introduces Historian, a novel framework that employs LLMs for evidence-based, multi-reference patch assessment, addressing limitations of existing methods.

Findings

01

Achieves 95% coverage with 88.4% accuracy in patch validation.

02

Reduces manual validation to 5% of patches.

03

Enhances existing APCA tools by up to 21.8% in accuracy.

Abstract

Assessing the correctness of patches generated by Automated Program Repair (APR) is a major bottleneck. Manual validation is labor-intensive and limited: exact matching overlooks valid variants, while semantic inspection is subjective and hard to reproduce. Existing Automated Patch Correctness Assessment (APCA) often relies on opaque predictive models that treat each patch as novel, repeatedly re-assessing semantically redundant patches. Our analysis of a large corpus of tool-generated patches reveals a duality: about 39% of unique correct patches are syntactic clones, suggesting opportunities for automation, yet about 65% of bugs have multiple distinct correct fixes, making single-reference assessment insufficient. We present Historian, a framework that leverages Large Language Models to perform multi-reference comparisons against a knowledge base of historically validated patches,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research