What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou; Haoxiang Jia; Shenxi Wu; Huiyuan Zheng; Muling Wu; Yunbo Tao; Ming Zhang; Mingxu Chai; Jessica Fan; Zhiheng Xi; Rui Zheng; Yueming Wu; Ming Wen; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang

arXiv:2407.06153·cs.SE·October 20, 2025·6 cites

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Muling Wu, Yunbo Tao, Ming Zhang, Mingxu Chai, Jessica Fan, Zhiheng Xi, Rui Zheng, Yueming Wu, Ming Wen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang

PDF

Open Access

TL;DR

This extensive empirical study evaluates the performance and limitations of various large language models in code generation, identifying challenges with complex problems, bug types, and proposing a self-critique correction method.

Contribution

The paper provides a comprehensive analysis of LLMs' code generation capabilities, introduces a bug taxonomy, and proposes a novel self-critique method to improve code quality without additional training.

Findings

01

LLMs struggle with complex problems, producing shorter yet more complicated code.

02

A bug taxonomy with 3 categories and 10 sub-categories was developed and analyzed.

03

A self-critique iterative method improves code correctness without retraining.

Abstract

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsSoftmax · Attention Is All You Need