DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation
Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun, Zhang, and Chun Zuo

TL;DR
This paper introduces DeepCRCEval, a new evaluation framework for code review comments that combines human and LLM assessments, addressing limitations of traditional text similarity metrics and improving evaluation reliability and efficiency.
Contribution
It proposes an innovative evaluation framework integrating human and LLM evaluators, and introduces LLM-Reviewer as a new baseline for comment quality assessment.
Findings
Less than 10% of benchmark comments are high quality for automation.
DeepCRCEval effectively distinguishes comment quality levels.
Incorporating LLM evaluators reduces evaluation time and cost significantly.
Abstract
Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects. This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
