Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge
Sherry Shi, Renyao Wei, Michele Tufano, Jos\'e Cambronero, Runxiang Cheng, Franjo Ivan\v{c}i\'c, Pat Rondon

TL;DR
This paper proposes a human-in-the-loop framework using LLMs to reliably evaluate patch validity in automated program repair, reducing manual effort while maintaining high agreement with human judgments.
Contribution
It introduces a novel LLM-based patch evaluation method with a shared rubric and human refinement, improving reliability and consistency in patch validity assessment.
Findings
Achieves Cohen's kappa of 0.75 with human consensus
High recall of 0.94 in patch validity detection
Precision of 0.80 when patches have unanimous human agreement
Abstract
Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research
