A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era
Taufiqul Islam Khan, Shaowei Wang, Haoxiang Zhang, and Tse-Hsun Chen

TL;DR
This survey comprehensively analyzes 99 code review benchmarks from 2015 to 2025, highlighting trends, limitations, and future directions to improve evaluation practices in both pre-LLM and LLM eras.
Contribution
It provides a systematic taxonomy of code review research, analyzes existing benchmarks, and outlines future directions for more effective evaluation of LLM-based code review tools.
Findings
Shift towards end-to-end generative peer review
Increase in multilingual code review coverage
Decline in standalone change understanding tasks
Abstract
Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body of work has explored automated support for code review. However, progress in this area is hindered by the lack of a systematic understanding of existing benchmarks and evaluation practices. Current code review datasets are scattered, vary widely in design, and provide limited insight into what review capabilities are actually being assessed. In this paper, we present a comprehensive survey of code review benchmarks spanning both the Pre-LLM and LLM eras (2015--2025). We analyze 99 research papers (58 Pre-LLM era and 41 LLM era) and extract key metadata, including datasets, evaluation metrics, data sources, and target tasks. Based on this analysis, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Engineering Techniques and Practices
