Large Language Models as Optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny, Zhou, Xinyun Chen

TL;DR
This paper introduces OPRO, a novel method that uses large language models as optimizers by prompting them with previous solutions, significantly improving prompt quality and task performance in various benchmarks.
Contribution
The paper presents a new approach called Optimization by PROMpting (OPRO) that leverages LLMs as optimizers for problems without gradients, demonstrating superior performance over human-designed prompts.
Findings
OPRO outperforms human-designed prompts by up to 8% on GSM8K.
OPRO improves task accuracy by up to 50% on Big-Bench Hard.
Demonstrates effectiveness of LLMs as optimizers across multiple tasks.
Abstract
Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts…
Peer Reviews
Decision·ICLR 2024 poster
This work demonstrates that LLMs can help optimize prompts to achieve high performance on a variety of tasks.
First, I disagree with the authors fundamentally about what optimization means. To me, this work is not optimization but step-by-step inference. To quote Wikipedia for reference, optimization is "the selection of a best element, with regard to some criterion, from some set of available alternatives." One can plausibly consider the process of prompt selection as "optimization", but in order to make a claim on the general area of optimization I would expect results on optimizing a wide range of co
- Novel idea of leveraging LLMs' understanding of natural language and few-shot learning abilities for optimization. Enables optimization by simply describing the problem rather than formal specification. - Demonstrated on diverse tasks - mathematical optimization, prompt optimization. Shows potential breadth of this approach. - Compelling results on prompt optimization. Optimized prompts substantially outperform human-written prompts, improving accuracy by up to 8% on GSM8K and 50% on BigBenc
- The biggest limitation is that OPRO's performance looks highly fluctuating. It's unclear if the LLM really finds the so-called optimization "trajectory" or just randomly finds a good prompt. The authors should provide more analysis to show that the LLM is indeed learning to optimize. - Limited exploration on how to provide richer feedback to LLM beyond aggregated scores. It could help address limitations. - Unclear how sensitive results are to meta-prompt design and hyperparameters like tem
1. Good proof of concept. This paper provides concrete evidence that large language models can find the patterns between the inputs and corresponding scores that humans might not be able to find to conduct optimization tasks. 2. Good use case. Based on such proof of concept, this paper finds a valid use case for the proposed method, on which other traditional optimizations might be difficult to apply, finding the proper prompts for LLM tasks. 3. Solid experiments on prompt search. Experiments sh
Two questions on the ablation study. 1. Numbers of examplers. Did you take the randomness of example picking into consideration? For each run of every setting, do you give the same set of examples? 2. I noticed that for different tasks, the “batch size” that works the best can be different (Figure 5, cd). Do you find any obvious patterns on which types of data/tasks prefer a smaller “batch size” and vice versa?
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsLinear Regression
