Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge

Sherry Shi; Renyao Wei; Michele Tufano; Jos\'e Cambronero; Runxiang Cheng; Franjo Ivan\v{c}i\'c; Pat Rondon

arXiv:2511.10865·cs.SE·November 17, 2025

Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge

Sherry Shi, Renyao Wei, Michele Tufano, Jos\'e Cambronero, Runxiang Cheng, Franjo Ivan\v{c}i\'c, Pat Rondon

PDF

Open Access

TL;DR

This paper proposes a human-in-the-loop framework using LLMs to reliably evaluate patch validity in automated program repair, reducing manual effort while maintaining high agreement with human judgments.

Contribution

It introduces a novel LLM-based patch evaluation method with a shared rubric and human refinement, improving reliability and consistency in patch validity assessment.

Findings

01

Achieves Cohen's kappa of 0.75 with human consensus

02

High recall of 0.94 in patch validity detection

03

Precision of 0.80 when patches have unanimous human agreement

Abstract

Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research