LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo

TL;DR
This paper reviews the emerging use of Large Language Models as automated judges for evaluating software engineering outputs, highlighting current limitations, research gaps, and a future roadmap toward scalable, reliable evaluation methods by 2030.
Contribution
It provides a comprehensive literature review, identifies key research gaps, and outlines a detailed roadmap for developing LLM-as-a-Judge frameworks in software engineering.
Findings
LLM-as-a-Judge offers a promising scalable evaluation approach.
Current research is still in early stages with many limitations.
A future roadmap aims to develop robust, multi-faceted evaluation frameworks.
Abstract
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
