From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, Xiaodong Gu

TL;DR
This paper presents MGDebugger, a hierarchical debugging system that decomposes code into subfunctions to identify and fix errors at multiple levels, significantly improving code correctness in language model-generated programs.
Contribution
Introduces a novel hierarchical debugging approach with an LLM-simulated executor to improve bug detection and fixing across multiple granularity levels in generated code.
Findings
Achieves 18.9% higher accuracy in HumanEval
Reaches 97.6% repair success rate in HumanEvalFix
Effectively handles bugs of various categories and difficulty levels
Abstract
While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper tackles the problem of code generation using a multi-step approach. This is a timely contribution given the popularity of multi-step approaches and code generation. The paper is well written and organized. The individual components of the approach as well as using multi-step approaches for code generation are common; however, in the proposed approach various such components come together in a novel fashion.
I overall think the evaluations could be stronger. - This is especially important considering the stochasticity of LLMs and even further increased variability for multi-step approaches. The results lack error bars. - The evaluations are done using three datasets, which are all python based and composed of mainly basic problems. It would be great to see more diversity in benchmarks to better convey an understanding of the limitations of the proposed approach. You can consider adding the MultiPL-
MGDebugger offers a structured, hierarchical approach that isolates bugs at multiple granularity levels, an advance over monolithic debugging methods. It outperforms existing techniques like Reflexion and Self-Debugging on benchmarks such as HumanEval, providing a clearer, systematic debugging process. The methodology and experimental results are well-presented, showcasing MGDebugger’s potential to improve reliability in LLM-generated code.
MGDebugger’s novelty is somewhat limited as its approach is similar to **[1]**. The evaluation could be broadened with more datasets, like SweBench, and by including mainstream models like LLaMA 3.1 and CodeLlama to better gauge generalizability and effectiveness in diverse debugging scenarios. **References** **[1]** Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. *Agentless: Demystifying LLM-based Software Engineering Agents*. arXiv preprint arXiv:2407.01489, 2024.
1. This paper presents a promising insight: LLMs can debug programs in modular functions and resolve program errors level by level. 2. The idea is presented clearly and the experiment results presents significant improvements. 3. The authors conducts extensive experiments on the ablation and debugging improvements compared to the existing methods.
1. The paper only evaluated on open-source code models, which possibly not good at the natural language explanation generation. It would be great if the paper can conduct experiment on main-stream closed-source models (for example GPT-4, Claude2) and open-source models with good natural language capabilities (for example Llama-3.1). 2. The paper proposed methods that highly depend on LLMs' reasoning and language analysis capabilities, which I have concerns about. First, the paper use LLMs to de
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Advanced Software Engineering Methodologies
