Automated Classification of Human Code Review Comments with Large Language Models
Semih \c{C}a\u{g}lar, \c{S}\"ukr\"u Eren G\"ok{\i}rmak, Eray T\"uz\"un

TL;DR
This paper develops and evaluates an automated system using large language models to classify code review comments into specific issue categories, addressing issues like redundancy and vagueness.
Contribution
It introduces a nine-label taxonomy for classifying review comments and benchmarks zero-shot and one-shot LLM performance on this task.
Findings
Zero-shot macro-F1 scores ranged from 0.360 to 0.374.
One-shot exemplars improved performance for some models and labels.
Classification of evidence-sensitive labels remains challenging.
Abstract
Context: Code reviews are essential for maintaining software quality, yet many human review comments suffer from issues such as redundancy, vagueness, or lack of constructiveness. These types of comments may slow down feedback and obscure important insights. Prior work on code review comments mostly explore the detection and categorization of useful comments, while fine-grained categorization of comment issues remains underexplored. Objective: This work aims to design and evaluate an automated system for classifying code review comments according to specific categories of issues. Methodology: We introduced a nine-label taxonomy for code review comments, covering six review comment smells and three common useful intents, and manually labeled 448 comments from a publicly available dataset. We benchmarked zero-shot and one-shot single-label classification over each comment and its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
