Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge   in Software Engineering

Ruiqi Wang; Jiyu Guo; Cuiyun Gao; Guodong Fan; Chun Yong Chong; Xin; Xia

arXiv:2502.06193·cs.SE·April 22, 2025·2 cites

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, Xin, Xia

PDF

Open Access

TL;DR

This study empirically evaluates the effectiveness of LLMs as automated judges for software engineering tasks, finding that output-based LLM evaluation methods closely align with human judgments and outperform traditional metrics.

Contribution

It provides the first comprehensive empirical comparison of LLM-as-a-judge methods against human evaluation in SE tasks, demonstrating their high correlation and potential as automated evaluators.

Findings

01

Output-based LLM evaluation methods achieve up to 81.32 Pearson correlation with human scores.

02

These methods outperform traditional metrics like ChrF++ significantly.

03

LLM-based evaluations produce more human-like and balanced score distributions.

Abstract

Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law