Studying Quality Improvements Recommended via Manual and Automated Code Review
Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota

TL;DR
This study compares human and AI (ChatGPT-4) code reviews, revealing AI's limitations in identifying quality issues but also its potential as a supplementary tool to human reviewers.
Contribution
It provides a detailed comparison of human and AI code review recommendations, highlighting the strengths and limitations of current DL-based approaches.
Findings
ChatGPT recommends 2.4 times more code changes than humans.
ChatGPT detects only 10% of issues identified by humans.
Approximately 40% of AI suggestions point to meaningful quality issues.
Abstract
Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend quality improvements as a human reviewer. We study the similarities and differences between code reviews performed by humans and those automatically generated by DL models, using ChatGPT-4 as representative of the latter. In particular, we run a mining-based study in which we collect and manually inspect 739 comments posted by human reviewers to suggest code changes in 240 PRs. The manual inspection aims at classifying the type of quality improvement recommended by human reviewers (e.g., rename variable/constant). Then, we ask ChatGPT to perform a code review on the same PRs and we compare the quality improvements it recommends against those suggested by the human reviewers. We show that while, on average, ChatGPT tends to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI
