TL;DR
LETI introduces a method for fine-tuning language models using textual feedback from code execution errors, improving code generation without relying on ground-truth outputs and enhancing sample efficiency.
Contribution
The paper presents LETI, a novel approach that leverages textual error feedback for training language models, outperforming traditional methods that depend on input-output pairs or numerical rewards.
Findings
LETI improves code generation performance on MBPP and HumanEval datasets.
Textual feedback enhances sample efficiency, reducing training steps needed.
LETI generalizes to natural language tasks like event argument extraction.
Abstract
Fine-tuning pre-trained language models (LMs) is essential for enhancing their capabilities. Existing techniques commonly fine-tune on input-output pairs (e.g., instruction tuning) or with numerical rewards that gauge the output quality (e.g., RLHF). We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. This setting invites a natural and scalable way to acquire textual feedback: the error messages and stack traces from code execution using a Python interpreter. LETI iteratively fine-tunes the model, using the LM objective, on a concatenation of natural language instructions, LM-generated programs, and textual feedback. Prepended to…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
Using execution environments' error messages to provide a good/bad feedback split to fine tune is a good strategy and uses a ready resource available as a benefit to code gen models. Such meta data can be further used as future work to assess severity of errors in output which can be used to differentiate multiple outputs from different temperature settings. Any solution which helps bring smaller models up to par with larger models through smarter training is especially beneficial given the reso
I would love to see more programming languages handled in this work; it feels very narrowly defined using Python problems and runs the risk of the specific interpreter's error generation capabilities overfitting the solution which might not replicate as we move to other languages or run time environments. Execution oriented solutions further are heavily dependent on run time environment and specific to it. Otherwise you are limited to syntactic checkers or basic execution approaches which might
1) The paper is well motivated, and focuses on how iterative refinement is an important aspect of producing high-quality, correct code solutions. Improving via execution feedback is very interesting and this paper proposes a detailed solution for this. 2) The paper shows relevant gains of LETI on the MBPP test set, over a baseline pretrained model, breaking down the different ways in which it corrects errors such as SyntaxError, NameErrors. 3) The improvement in sample efficiency using textua
1) Results for generalization and robustness do not back the claims adequately. a. In Table 2, The improvement over HumanEval is mixed for pass@1, with the pretrained model doing better than LETI for 2B. This could be attributed to error (given the small size of HumanEval), so averaging over different runs/seeds would be preferable. Also, if this could be tried on Spider/other code generation tasks this case could be made better. b. In Table 5, the pretrained 2B model performs better
S1: The training data can be collected automatically (i.e., no human-in-loop) in an iterative manner without the need for gold-standard solutions, which in principle can easily scale to larger datasets and problem sets. I think such bootstrapping methods are important for further improving LLMs given their data-hungry nature; S2: The way to construct each training example is quite interesting. Instead of training the models to predict the error message from buggy programs, LETI reverses the or
W1: The major weakness of this work is the soundness of the experiments. More specifically: * W1.1: Using *TheStack* as part of the "pretraining dataset". As the authors noted in footnote 4, CodeGen-mono is trained on BigPython (actually it was first trained on the Pile and then BigQuery, then BigPython), and TheStack might contain a substantial amount of code that CodeGen-mono models have never seen before. This could contribute to the performance improvements that are perceived as the effect o
1. The paper present comprehensive experiments and evaluations on different datasets and tasks and is well-written and clearly structured. 2. The proposed LETI paradigm holds potential for improving LM's capabilities in various tasks. Its ability to leverage textual feedback for better generation quality and sample efficiency highlights its practical significance, and its successful application to both programming language and natural language tasks suggests that this paradigm can be extended to
1. Evaluation on larger models is needed to provide insights into its scalability and effectiveness. 2. Investigating the impact of different solution evaluator designs on LETI's performance would be informative, as biases may be introduced when optimizing towards certain metrics. 3. Evaluating LETI's effectiveness in other domains and tasks would further validate its generalizability. 4. Comparing LETI with other RL-based approaches that leverage rewards or value functions would help establish
1. The paper studied an interesting problem of learning interactively from environmental feedback, without needing ground-truth annotations. 2. The experiments are generally solid and comprehensive. When it is applied to a 2B CodeGen LM, LETI improves the LM to even outperform the traditional, human-annotation-required, fine-tuned baseline. This is then supplied by additional analyses confirming the advantage (performance and sample efficiency) of textual feedback compared with using only the b
1. I don't see any significant weaknesses in the proposed approach, but there are a few questions that I would like to have the authors' clarification. See Questions. 2. Missing references. There have been many more works about "using feedback to improve code generation", e.g., the following and their follow-ups or referred papers. - Elgohary, A., Hosseini, S., & Awadallah, A. H. (2020, July). Speak to your Parser: Interactive Text-to-SQL with Natural Language Feedback. In Proceedings of the 58t
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
MethodsBalanced Selection
