Prompting Large Language Models to Tackle the Full Software Development   Lifecycle: A Case Study

Bowen Li; Wenhan Wu; Ziwei Tang; Lin Shi; John Yang; Jinyang Li,; Shunyu Yao; Chen Qian; Binyuan Hui; Qicheng Zhang; Zhiyin Yu; He Du; Ping; Yang; Dahua Lin; Chao Peng; Kai Chen

arXiv:2403.08604·cs.CL·December 17, 2024·3 cites

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li,, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping, Yang, Dahua Lin, Chao Peng, Kai Chen

PDF

Open Access 2 Repos

TL;DR

This paper evaluates large language models across the entire software development lifecycle using DevEval, revealing current models' limitations in handling real-world programming tasks and providing insights for future improvements.

Contribution

It introduces DevEval, a comprehensive benchmark for assessing LLMs on full software development stages across multiple languages and domains, highlighting current models' shortcomings.

Findings

01

Current LLMs, including GPT-4, struggle with real-world programming challenges.

02

DevEval covers design, setup, implementation, and testing stages.

03

Empirical results show significant performance gaps in existing models.

Abstract

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software Engineering Techniques and Practices

MethodsLinear Layer · Dropout · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing