Can Large Language Models Serve as Evaluators for Code Summarization?

Yang Wu; Yao Wan; Zhaoyang Chu; Wenting Zhao; Ye Liu; Hongyu Zhang,; Xuanhua Shi; Philip S. Yu

arXiv:2412.01333·cs.SE·December 3, 2024

Can Large Language Models Serve as Evaluators for Code Summarization?

Yang Wu, Yao Wan, Zhaoyang Chu, Wenting Zhao, Ye Liu, Hongyu Zhang,, Xuanhua Shi, Philip S. Yu

PDF

Open Access 1 Repo

TL;DR

This paper investigates using Large Language Models as automatic evaluators for code summarization, proposing a novel role-playing prompting method that significantly improves correlation with human judgments.

Contribution

It introduces CODERPE, a role-player prompting approach leveraging LLMs to evaluate code summaries, outperforming traditional automatic metrics in aligning with human evaluations.

Findings

01

LLMs achieve 81.59% Spearman correlation with human judgments.

02

CODERPE outperforms BERTScore by 17.27%.

03

Role-based prompting enhances evaluation robustness.

Abstract

Code summarization facilitates program comprehension and software maintenance by converting code snippets into natural-language descriptions. Over the years, numerous methods have been developed for this task, but a key challenge remains: effectively evaluating the quality of generated summaries. While human evaluation is effective for assessing code summary quality, it is labor-intensive and difficult to scale. Commonly used automatic metrics, such as BLEU, ROUGE-L, METEOR, and BERTScore, often fail to align closely with human judgments. In this paper, we explore the potential of Large Language Models (LLMs) for evaluating code summarization. We propose CODERPE (Role-Player for Code Summarization Evaluation), a novel method that leverages role-player prompting to assess the quality of generated summaries. Specifically, we prompt an LLM agent to play diverse roles, such as code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CGCL-codes/naturalcc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research