Evaluating Source Code Quality with Large Language Models: a comparative study
Igor Regis da Silva Sim\~oes, Elaine Venson

TL;DR
This study explores the potential of large language models to evaluate source code quality, comparing their assessments with traditional static analysis tools like SonarQube across open source Java projects.
Contribution
It provides an empirical comparison of GPT 3.5 Turbo and GPT 4o in assessing code quality, highlighting their capabilities and limitations.
Findings
GPT 3.5 Turbo correlates with SonarQube metrics
GPT 4o diverges from traditional assessments
LLMs show potential but have limitations in code quality evaluation
Abstract
Code quality is an attribute composed of various metrics, such as complexity, readability, testability, interoperability, reusability, and the use of good or bad practices, among others. Static code analysis tools aim to measure a set of attributes to assess code quality. However, some quality attributes can only be measured by humans in code review activities, readability being an example. Given their natural language text processing capability, we hypothesize that a Large Language Model (LLM) could evaluate the quality of code, including attributes currently not automatable. This paper aims to describe and analyze the results obtained using LLMs as a static analysis tool, evaluating the overall quality of code. We compared the LLM with the results obtained with the SonarQube software and its Maintainability metric for two Open Source Software (OSS) Java projects, one with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
