On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization
Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, Gabriele Bavota

TL;DR
This paper evaluates the effectiveness of large language models as judges for code generation and summarization, finding GPT-4-turbo to be the most capable but still imperfect in assessing quality.
Contribution
It provides an empirical assessment of LLMs as judges for code-related tasks, highlighting their strengths and limitations compared to human judgment.
Findings
GPT-4-turbo outperforms smaller LLMs in judging code and summaries.
Smaller LLMs with fewer parameters struggle with judging tasks.
Even the best LLMs often misjudge code correctness and summary quality.
Abstract
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. LLMs-as-a-judge, if proven effective for a specific task, can also unlock new possibilities for automation, with several LLMs proposing a solution for a given instance of the task and others judging and deciding what is the best output to show the user. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. The rationale for choosing these tasks is two-fold. First, quantitative metrics are usually not enough for the assessment of code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing
