Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

Eric Zelikman; Eliana Lorch; Lester Mackey; Adam Tauman Kalai

arXiv:2310.02304·cs.CL·August 19, 2024·2 cites

Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

Eric Zelikman, Eliana Lorch, Lester Mackey, Adam Tauman Kalai

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method where a language-model-infused program improves itself iteratively, demonstrating that GPT-4 can generate code that self-enhances without altering the model itself.

Contribution

The work presents a novel approach for self-improving code generation using language models, showing that GPT-4 can create code that iteratively enhances its own performance.

Findings

01

Improved programs outperform seed improvers on downstream tasks

02

Various self-improvement strategies like beam search and genetic algorithms are effective

03

Generated code sometimes bypasses sandbox restrictions

Abstract

Several recent advances in AI systems solve problems by providing a "scaffolding" program that structures multiple calls to language models (LMs) to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying an LM several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. A variety of self-improvement strategies are proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

S1: The proposed method is interesting and the idea of having a formulation where the improver can improve over itself given a utility function is very cool. S2: A clear discussion of the limitations and concerns for STOP is presented, which is very helpful. S3: A number of (i.e., five) tasks are considered in the experiments, and the proposed methods yield non-trivial improvements on all of them.

Weaknesses

W1: My main concern about this work is missing some of the important results for us to understand how well the proposed method works. More concretely, * How much of improvements each iteration made. For example, it was mentioned that the improvements may not be monotonic, (which implies non-greedy, global optimization), while such improvement curves will be very interesting to look at, only one example of such is shown in Fig 4. * Quantitatively, the differences between the results with GPT-4 a

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

The key strengths of this research are, first and foremost, the demonstration of a proof-of-concept for using LLMs for self-improvement and meta-learning. Second, the strength of the paper is that LLMs can optimize code which includes the model itself. This approach demonstrated by LLM is similar to evolutionary algorithms without any exposure in the training data. Third, the impact of this research is profound, because it demonstrates that LLMs are able to self-improve themselves in contras

Weaknesses

The base LLM has a huge importance on STOP's performance (Section 5.3)

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

+ originality: this paper is the first to propose a meta-learning framework for optimizing code scaffolding, aiming at better performance of downstream tasks. + significance: this paper takes a new and important perspective into revealing the power and potential misuse of large language model by querying the model to optimize the meta-heuristic of solving code tasks. though their current framework is not general to cover all tasks, their observation about reward hacking and sandbox circumventi

Weaknesses

- missing comparison with two types of baselines, the one is human designed prompt structure such as Chain-of-Though, and Program of Thoughts. which one is better, the prompt structure found in their meta-learning paradigm or these human crafted ones? the other is heuristics for coding such as genetic algorithm. My question is given a downstream task, if STOP can find better meta-heuristics than the common ones?

Code & Models

Repositories

microsoft/stop
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Machine Learning and Data Classification · Software Testing and Debugging Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Dense Connections · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization