An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

Xin Zhou; Kisub Kim; Ting Zhang; Martin Weyssow; Luis F. Gomes; Guang Yang; Kui Liu; Xin Xia; David Lo

arXiv:2505.20854·cs.SE·October 13, 2025

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, Kui Liu, Xin Xia, David Lo

PDF

Open Access

TL;DR

SE-Jury is a novel ensemble-based evaluation metric for software artifacts generated by LLMs, significantly improving correlation with human judgment across multiple SE tasks, thus offering a scalable alternative to manual assessment.

Contribution

This paper introduces SE-Jury, the first LLM-based ensemble evaluation metric tailored for assessing software artifact correctness, combining multiple strategies for improved accuracy.

Findings

01

SE-Jury outperforms existing metrics with 29.6% to 140.8% higher correlation to human judgment.

02

It achieves agreement levels with human annotators close to inter-annotator agreement.

03

Demonstrates effectiveness across code generation, repair, and summarization tasks.

Abstract

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Law and Ethics · Corporate Governance and Law · Digitalization, Law, and Regulation

MethodsSparse Evolutionary Training