Retrieval-Augmented Code Review Comment Generation
Hyunsun Hong, Jongmoon Baik

TL;DR
This paper introduces a retrieval-augmented generation approach for automated code review comment generation, combining the strengths of generation-based and IR-based methods to improve accuracy and token recovery.
Contribution
It proposes a retrieval-augmented generation method that conditions pretrained language models on retrieved code review examples, enhancing comment generation quality.
Findings
Outperforms existing generation-based and IR-based methods in accuracy.
Improves low-frequency token generation by up to 24%.
Performance increases with more retrieved exemplars.
Abstract
Automated code review comment generation (RCG) aims to assist developers by automatically producing natural language feedback for code changes. Existing approaches are primarily either generation-based, using pretrained language models, or information retrieval-based (IR), reusing comments from similar past examples. While generation-based methods leverage code-specific pretraining on large code-natural language corpora to learn semantic relationships between code and natural language, they often struggle to generate low-frequency but semantically important tokens due to their probabilistic nature. In contrast, IR-based methods excel at recovering such rare tokens by copying from existing examples but lack flexibility in adapting to new code contexts-for example, when input code contains identifiers or structures not found in the retrieval database. To bridge the gap between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling
