Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, Xin, Xia

TL;DR
This study empirically evaluates the effectiveness of LLMs as automated judges for software engineering tasks, finding that output-based LLM evaluation methods closely align with human judgments and outperform traditional metrics.
Contribution
It provides the first comprehensive empirical comparison of LLM-as-a-judge methods against human evaluation in SE tasks, demonstrating their high correlation and potential as automated evaluators.
Findings
Output-based LLM evaluation methods achieve up to 81.32 Pearson correlation with human scores.
These methods outperform traditional metrics like ChrF++ significantly.
LLM-based evaluations produce more human-like and balanced score distributions.
Abstract
Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
