DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Junyi Lu; Xiaojia Li; Zihan Hua; Lei Yu; Shiqi Cheng; Li Yang; Fengjun; Zhang; and Chun Zuo

arXiv:2412.18291·cs.SE·January 28, 2025

DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun, Zhang, and Chun Zuo

PDF

Open Access

TL;DR

This paper introduces DeepCRCEval, a new evaluation framework for code review comments that combines human and LLM assessments, addressing limitations of traditional text similarity metrics and improving evaluation reliability and efficiency.

Contribution

It proposes an innovative evaluation framework integrating human and LLM evaluators, and introduces LLM-Reviewer as a new baseline for comment quality assessment.

Findings

01

Less than 10% of benchmark comments are high quality for automation.

02

DeepCRCEval effectively distinguishes comment quality levels.

03

Incorporating LLM evaluators reduces evaluation time and cost significantly.

Abstract

Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects. This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training