Uncovering Weaknesses in Neural Code Generation
Xiaoli Lian, Shuaisong Wang, Jieping Ma, Fang Liu, Xin Tan, Li Zhang,, Lin Shi, Cuiyun Gao

TL;DR
This paper systematically evaluates state-of-the-art neural code generation models, identifying key weaknesses such as prompt inaccuracies, missing semantics, and API usage issues, to guide future research improvements.
Contribution
It provides the first comprehensive taxonomy of weaknesses in neural code generation, analyzing multiple models across diverse datasets with detailed thematic insights.
Findings
Large models fail in 26.84% of cases due to inaccurate prompts
Missing key semantics occurs in over 65% of tasks across datasets
All models struggle with proper API usage, especially with vague prompts
Abstract
Code generation, the task of producing source code from prompts, has seen significant advancements with the advent of pre-trained large language models (PLMs). Despite these achievements, there lacks a comprehensive taxonomy of weaknesses about the benchmark and the generated code, which risks the community's focus on known issues at the cost of under-explored areas. Our systematic study aims to fill this gap by evaluating five state-of-the-art PLMs: three larger models, CodeGen2.5 with 7 billion parameters, CodeGeeX2 with 6 billion parameters, GPT-4 Turbo, and two smaller ones, UnixCoder with 110 million parameters and CodeT5 base with 220 million parameters, across three popular datasets, CoNaLa, HumanEval Plus, and DS-1000. We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
