Evaluating Large Language Models for Code Review

Umut Cihan; Arda \.I\c{c}\"oz; Vahid Haratian; Eray T\"uz\"un

arXiv:2505.20206·cs.SE·May 27, 2025

Evaluating Large Language Models for Code Review

Umut Cihan, Arda \.I\c{c}\"oz, Vahid Haratian, Eray T\"uz\"un

PDF

Open Access

TL;DR

This paper systematically evaluates the accuracy of large language models like GPT-4 and Gemini 2.0 Flash in performing code reviews, highlighting their potential and limitations in detecting correctness and suggesting improvements.

Contribution

It provides a comparative analysis of LLMs' performance in code review tasks and introduces a human-in-the-loop process to mitigate risks of faulty outputs.

Findings

01

GPT-4 correctly classified code correctness 68.50% of the time with descriptions.

02

Gemini 2.0 Flash achieved 63.89% accuracy in correctness classification.

03

Performance declines without problem descriptions and varies with code type.

Abstract

Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs' performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Software Reliability and Analysis Research