Distilling Desired Comments for Enhanced Code Review with Large Language Models
Yongda Yu, Lei Zhang, Guoping Rong, Haifeng Shen, Jiahao Zhang,, Haoxiang Yan, Guohao Shi, Dong Shao, Ruiqi Pan, Yuan Li, Qiushi Wang, Zhao, Tian

TL;DR
This paper introduces Desiview, a dataset distillation method that automatically extracts desired review comments from code review data, significantly improving LLMs' ability to generate accurate and relevant code review comments.
Contribution
The paper proposes Desiview, a novel automatic dataset distillation approach for enhancing LLMs in code review tasks, and demonstrates its effectiveness with state-of-the-art performance.
Findings
Desiview achieves over 88% precision and 86% accuracy in identifying desired review comments.
Fine-tuning LLaMA models with the distilled dataset improves their code review comment generation.
Enhanced models outperform base LLMs in accuracy and relevance of review comments.
Abstract
There has been a growing interest in using Large Language Models (LLMs) for code review thanks to their proven proficiency in code comprehension. The primary objective of most review scenarios is to generate desired review comments (DRCs) that explicitly identify issues to trigger code fixes. However, existing LLM-based solutions are not so effective in generating DRCs for various reasons such as hallucination. To enhance their code review ability, they need to be fine-tuned with a customized dataset that is ideally full of DRCs. Nevertheless, such a dataset is not yet available, while manual annotation of DRCs is too laborious to be practical. In this paper, we propose a dataset distillation method, Desiview, which can automatically construct a distilled dataset by identifying DRCs from a code review dataset. Experiments on the CodeReviewer dataset comprising more than 150K review…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
MethodsBalanced Selection · LLaMA
