Teaching Large Language Models to Self-Debug

Xinyun Chen; Maxwell Lin; Nathanael Sch\"arli; Denny Zhou

arXiv:2304.05128·cs.CL·October 6, 2023·72 cites

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch\"arli, Denny Zhou

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces Self-Debugging, a method enabling large language models to identify and correct their own code errors through natural language explanations, significantly improving code generation accuracy across multiple benchmarks.

Contribution

It presents a novel self-debugging approach that teaches LLMs to perform rubber duck debugging without human feedback, enhancing code correctness and sample efficiency.

Findings

01

Achieves state-of-the-art results on code generation benchmarks.

02

Improves accuracy by 2-12% depending on dataset.

03

Enhances sample efficiency, matching larger models with fewer candidates.

Abstract

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Proposing a novel approach called SELF-DEBUGGING that enables a large language model to debug its own predicted program via few-shot demonstrations. 2. Demonstrating that SELF-DEBUGGING can teach the large language model to perform rubber duck debugging, i.e., identifying its mistakes by investigating the execution results and explaining the generated code in natural language. 3. Achieving state-of-the-art performance on several code generation benchmarks, including the Spider dataset for tex

Weaknesses

I do not find obvious weaknesses in this work. I only have a concern about the proposed approach. Since I'm not an expert in the code generation field, please correct me if I have some misunderstandings. This kind of "self-debug" or "self-refine" requires LLMs to inspect their own outputs based on the unit test results and generate some explanation in an autoregressive manner. So a concern is the additional latency in the inference time, especially for extremely large language models. This extra

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Proposed Self-Debugging approach based on code self-explanation ( rubber duck debugging ) and test execution result investigation. - Some improvement over baseline results.

Weaknesses

- Improvement over baseline for Spider benchmark is only 2-3% which is not shown to be statistically significant. It could be accidental result of prompt change. - Same issue with code explanation without debugging for TransCoder and MBPP. - It seems that Self-Debugging without unit tests executions has very limited and possibly statistically insignificant improvements. The following weaknesses have been fixed in the paper update by the authors: - Section 4 is very hard to read. It constantly

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The intuition and introduction of this paper is quite clear. The proposed method is simple, effective, and can be applied to any LLM model. Their method achieves promising performance improvement in code generation tasks.

Weaknesses

1. Although the introduction of this paper is clear, the methodology part is not the case. There are many components in the proposed approach: code execution, code explanation, inferring code correctness, etc. Figure 1 is helpful but still not clear enough. It would be better if there is a diagram with concrete example in the main text. 2. Although the paper claims that they improve sample efficiency, I am still doubtful about this as the self-debugging approach does require generating much more

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Parallel Computing and Optimization Techniques · Topic Modeling

MethodsRepair