Harnessing Large Language Models for Curated Code Reviews

Oussama Ben Sghaier; Martin Weyssow; Houari Sahraoui

arXiv:2502.03425·cs.SE·February 6, 2025

Harnessing Large Language Models for Curated Code Reviews

Oussama Ben Sghaier, Martin Weyssow, Houari Sahraoui

PDF

Open Access 1 Repo

TL;DR

This paper presents a data curation pipeline using large language models to improve the quality of code review datasets, resulting in better automated comment generation and code refinement.

Contribution

The authors develop a novel dataset curation pipeline that significantly enhances data quality for AI-based code review tasks using LLMs.

Findings

01

Curated dataset shows improved comment clarity and conciseness.

02

Enhanced dataset leads to better performance in comment generation.

03

Improved comments facilitate more accurate code refinement.

Abstract

In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code review itself but are also essential for subsequent tasks like code refinement, where the code is modified to satisfy the input review comment. Although various AI-based approaches aimed to automate comment generation, their effectiveness remains limited by the quality of the training data. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models and hindering the automation process. To address these challenges, we propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset. We begin by establishing an evaluation framework, incorporating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OussamaSghaier/CuREV
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Topic Modeling