Harnessing Large Language Models for Curated Code Reviews
Oussama Ben Sghaier, Martin Weyssow, Houari Sahraoui

TL;DR
This paper presents a data curation pipeline using large language models to improve the quality of code review datasets, resulting in better automated comment generation and code refinement.
Contribution
The authors develop a novel dataset curation pipeline that significantly enhances data quality for AI-based code review tasks using LLMs.
Findings
Curated dataset shows improved comment clarity and conciseness.
Enhanced dataset leads to better performance in comment generation.
Improved comments facilitate more accurate code refinement.
Abstract
In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code review itself but are also essential for subsequent tasks like code refinement, where the code is modified to satisfy the input review comment. Although various AI-based approaches aimed to automate comment generation, their effectiveness remains limited by the quality of the training data. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models and hindering the automation process. To address these challenges, we propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset. We begin by establishing an evaluation framework, incorporating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Topic Modeling
