LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

Junda He; Jieke Shi; Terry Yue Zhuo; Christoph Treude; Jiamou Sun; Zhenchang Xing; Xiaoning Du; David Lo

arXiv:2510.24367·cs.SE·October 29, 2025

LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo

PDF

TL;DR

This paper reviews the emerging use of Large Language Models as automated judges for evaluating software engineering outputs, highlighting current limitations, research gaps, and a future roadmap toward scalable, reliable evaluation methods by 2030.

Contribution

It provides a comprehensive literature review, identifies key research gaps, and outlines a detailed roadmap for developing LLM-as-a-Judge frameworks in software engineering.

Findings

01

LLM-as-a-Judge offers a promising scalable evaluation approach.

02

Current research is still in early stages with many limitations.

03

A future roadmap aims to develop robust, multi-faceted evaluation frameworks.

Abstract

The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.