Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Le Zhang; Suresh Kothari

arXiv:2512.18131·cs.SE·December 23, 2025

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Le Zhang, Suresh Kothari

PDF

Open Access

TL;DR

This paper empirically evaluates six leading LLMs for code generation across multiple programming languages, revealing performance disparities and emphasizing the importance of prompt engineering and human oversight for reliable results.

Contribution

It provides a comprehensive benchmark of state-of-the-art LLMs for code generation, highlighting their strengths, weaknesses, and practical considerations for deployment.

Findings

01

DeepSeek-R1 and GPT-4.1 outperform others in correctness and efficiency

02

Common failure modes include syntax errors and logical flaws

03

Prompt engineering significantly improves code quality

Abstract

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Model-Driven Software Engineering Techniques