Holistic Evaluation of State-of-the-Art LLMs for Code Generation
Le Zhang, Suresh Kothari

TL;DR
This paper empirically evaluates six leading LLMs for code generation across multiple programming languages, revealing performance disparities and emphasizing the importance of prompt engineering and human oversight for reliable results.
Contribution
It provides a comprehensive benchmark of state-of-the-art LLMs for code generation, highlighting their strengths, weaknesses, and practical considerations for deployment.
Findings
DeepSeek-R1 and GPT-4.1 outperform others in correctness and efficiency
Common failure modes include syntax errors and logical flaws
Prompt engineering significantly improves code quality
Abstract
This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Model-Driven Software Engineering Techniques
