Testing LLMs on Code Generation with Varying Levels of Prompt Specificity
Lincoln Murr, Morgan Grainger, David Gao

TL;DR
This study evaluates how prompt specificity affects the performance of various large language models in generating Python code, revealing optimal prompting strategies for accuracy and efficiency in automated code generation.
Contribution
It provides a comprehensive analysis of prompt specificity's impact on LLM code generation, offering practical guidelines for effective prompting strategies.
Findings
Performance varies significantly across LLMs and prompt types.
Optimal prompts improve code accuracy and efficiency.
Guidelines for effective prompt design in code generation.
Abstract
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing. Among the myriad of applications that benefit from LLMs, automated code generation is increasingly promising. The potential to transform natural language prompts into executable code promises a major shift in software development practices and paves the way for significant reductions in manual coding efforts and the likelihood of human-induced errors. This paper reports the results of a study that evaluates the performance of various LLMs, such as Bard, ChatGPT-3.5, ChatGPT-4, and Claude-2, in generating Python for coding problems. We focus on how levels of prompt specificity impact the accuracy, time efficiency, and space efficiency of the generated code. A benchmark of 104 coding problems, each with four types of prompts with varying degrees of tests and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
MethodsFocus
