An evaluation of LLM code generation capabilities through graded   exercises

\'Alvaro Barbero Jim\'enez

arXiv:2410.16292·cs.SE·October 23, 2024

An evaluation of LLM code generation capabilities through graded exercises

\'Alvaro Barbero Jim\'enez

PDF

Open Access

TL;DR

This paper reviews current evaluation methods for LLM code generation, introduces a new assessment on GPT4-o-mini across multiple languages, and finds that performance is influenced by task difficulty, language popularity, and potential training data leakage.

Contribution

It provides a comprehensive review of evaluation techniques and presents a new empirical assessment revealing factors affecting LLM code generation performance.

Findings

01

Model success correlates with task difficulty, language popularity, and challenge age.

02

Approximately 37.4% of performance may be due to training data leakage.

03

Current evaluation methods may overestimate LLM coding skills.

Abstract

Large Language Models have shown prominent capabilities in generating functional code from natural language descriptions. However, a standardized way to evaluate these capabilities in an objective and unbiased manner is still to be found. In this paper we review the current evaluation methods available to this end, and run a new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, a software development community. Our analysis shows that the chance of success of the model has a positive correlation with the task difficulty, the popularity of the programming language being used and the time elapsed since the publication of the challenge. A further approximate explanatory analysis in terms of high-level features hints that while 46.6% of the model performance could be attributed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security