Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework

Jianru Shen; Zedong Peng; Lucy Owen

arXiv:2602.02896·cs.SE·February 4, 2026

Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework

Jianru Shen, Zedong Peng, Lucy Owen

PDF

Open Access

TL;DR

This empirical study evaluates various enhancement strategies for LLM-based code generation, revealing how their effectiveness varies with failure types and proposing a decision framework to guide practitioners in choosing the best method.

Contribution

The paper introduces a data-driven decision framework that guides the selection of enhancement methods for LLM code generation based on failure characteristics.

Findings

01

Progressive prompting achieves 96.9% task completion, outperforming direct prompting.

02

Effectiveness of enhancement methods varies with failure types.

03

RAG method provides the highest overall completion and efficiency.

Abstract

Large language models (LLMs) show promise for automating software development by translating requirements into code. However, even advanced prompting workflows like progressive prompting often leave some requirements unmet. Although methods such as self-critique, multi-model collaboration, and retrieval-augmented generation (RAG) have been proposed to address these gaps, developers lack clear guidance on when to use each. In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion, significantly outperforming direct prompting (80.5%, Cohen's d=1.63, p<0.001) but still leaving 8 projects incomplete. For 6 of the most representative projects, we evaluated each enhancement strategy across 4 failure types. Our results reveal that method effectiveness depends critically on failure characteristics: Self-Critique succeeds on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software Engineering Research · Scientific Computing and Data Management