Planning-Driven Programming: A Large Language Model Programming Workflow
Chao Lei, Yanchuan Chang, Nir Lipovetzky, Krista A. Ehinger

TL;DR
This paper introduces a structured two-phase LLM programming workflow (LPW) that enhances code generation accuracy by combining solution planning, verification, and iterative refinement, significantly outperforming existing methods.
Contribution
The paper proposes a novel LLM programming workflow that improves initial code generation and refinement through structured planning and verification, achieving state-of-the-art accuracy on multiple benchmarks.
Findings
LPW improves Pass@1 accuracy by up to 16.4% over state-of-the-art methods.
LPW achieves new state-of-the-art Pass@1 accuracy on multiple benchmarks.
The workflow significantly enhances code correctness and refinement efficiency.
Abstract
The strong performance of large language models (LLMs) raises extensive discussion on their application to code generation. Recent research suggests continuous program refinements through visible tests to improve code generation accuracy in LLMs. However, these methods suffer from LLMs' inefficiency and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, the solution generation phase formulates a solution plan, which is then verified through visible tests to specify the intended natural language solution. Subsequently, the code implementation phase drafts an initial code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended solution to…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
## Novelty The paper successfully integrates previous working methods, Self-Planning [2] and Self-Debugging [3], into the LPW framework. In addition, LPW novelly introduces the "Plan Verification & Refinement" stage to iteratively refine the quality of the plan. ## Effectiveness The LPW framework demonstrates substantial improvements over previous agentic frameworks like Self-Planning [2], Self-Debugging [3], and LDB [4]. [1] [Parsel: Algorithmic Reasoning with Language Models by Composing
## Novelty Most design choices in LPW have been established and analyzed in previous works: the 1st stage "Plan Generation" is drawn from Self-Planning [2], and the 3rd stage "Iterative implementation & Refinement" follows various prior works [3, 4, 5], especially regarding the feedback mechanisms. The main novel contribution of LPW lies in its 2nd phase, "Plan Verification & Refinement," making the paper’s overall novelty contingent on the soundness and effectiveness of this one component.
1. This paper presents a well-designed framework. The technical details are solid and easy to reproduce. 2. This paper is presented clearly, illustrating the methods greatly. 3. The authors conduct a series of experiments, which is of great soundness.
1. The key point of LPW is the plan proposal and refinement based on visible tests. However, I'm not fully convinced about the refinement capability of plans since it is fully dependent on LLMs. The authors ask the LLM to dry run the plan through the test input, which in essence is a reasoning problem. It would require the LLM to be great at reasoning that it can dry run the operations and get the 'real output' and compare it with the 'grounded output'. Otherwise, it could derive wrong results e
This is an important research area, and projects such as this may have a big impact for practitioners. I believe this paper also includes some novel and interesting ideas, such as comparing the plan to the execution trace during debugging (to help the LLM reason about what went wrong), as well as (in the "sampling" version, SLPW) using a technique from the bandit literature (UCB) to select the set of plans to attempt to implement based on the number of unit tests that the plan is consistent with
Unfortunately, I was gravely disappointed by the analysis of the experiments in this paper. In their current form, I do not believe they are meaningful in any real sense - i.e., I don't think they accurately assess how this approach compares to simple baselines like just sampling IID from the model with a higher temperature. Reading lines 338-345 I believe anyone familiar with the code generation literature would immediately realize that the comparisons to the other methods will be unfair. In p
+ Well-motivated technique + Improved results of baselines
- Choice of baselines: I am a little baffled by the choice of baselines. The techniques compared against are mostly targeted at debugging, not code generation. Why not pick a few of a plethora of the other code generation techniques: AgentCoder, MapCoder, CodeChain, WizardCoder, etc? If there is a specific reason why this is being done, it needs to be discussed in detail in the paper. - Choice of benchmarks: The main evaluation is done on a very simple set of benchmarks. The MBPP and HumanEva
Code & Models
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques
