Large Language Models as Optimizers

Chengrun Yang; Xuezhi Wang; Yifeng Lu; Hanxiao Liu; Quoc V. Le; Denny; Zhou; Xinyun Chen

arXiv:2309.03409·cs.LG·April 16, 2024·92 cites

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny, Zhou, Xinyun Chen

PDF

Open Access 4 Repos 2 Videos 3 Reviews

TL;DR

This paper introduces OPRO, a novel method that uses large language models as optimizers by prompting them with previous solutions, significantly improving prompt quality and task performance in various benchmarks.

Contribution

The paper presents a new approach called Optimization by PROMpting (OPRO) that leverages LLMs as optimizers for problems without gradients, demonstrating superior performance over human-designed prompts.

Findings

01

OPRO outperforms human-designed prompts by up to 8% on GSM8K.

02

OPRO improves task accuracy by up to 50% on Big-Bench Hard.

03

Demonstrates effectiveness of LLMs as optimizers across multiple tasks.

Abstract

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

This work demonstrates that LLMs can help optimize prompts to achieve high performance on a variety of tasks.

Weaknesses

First, I disagree with the authors fundamentally about what optimization means. To me, this work is not optimization but step-by-step inference. To quote Wikipedia for reference, optimization is "the selection of a best element, with regard to some criterion, from some set of available alternatives." One can plausibly consider the process of prompt selection as "optimization", but in order to make a claim on the general area of optimization I would expect results on optimizing a wide range of co

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Novel idea of leveraging LLMs' understanding of natural language and few-shot learning abilities for optimization. Enables optimization by simply describing the problem rather than formal specification. - Demonstrated on diverse tasks - mathematical optimization, prompt optimization. Shows potential breadth of this approach. - Compelling results on prompt optimization. Optimized prompts substantially outperform human-written prompts, improving accuracy by up to 8% on GSM8K and 50% on BigBenc

Weaknesses

- The biggest limitation is that OPRO's performance looks highly fluctuating. It's unclear if the LLM really finds the so-called optimization "trajectory" or just randomly finds a good prompt. The authors should provide more analysis to show that the LLM is indeed learning to optimize. - Limited exploration on how to provide richer feedback to LLM beyond aggregated scores. It could help address limitations. - Unclear how sensitive results are to meta-prompt design and hyperparameters like tem

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

1. Good proof of concept. This paper provides concrete evidence that large language models can find the patterns between the inputs and corresponding scores that humans might not be able to find to conduct optimization tasks. 2. Good use case. Based on such proof of concept, this paper finds a valid use case for the proposed method, on which other traditional optimizations might be difficult to apply, finding the proper prompts for LLM tasks. 3. Solid experiments on prompt search. Experiments sh

Weaknesses

Two questions on the ablation study. 1. Numbers of examplers. Did you take the randomness of example picking into consideration? For each run of every setting, do you give the same set of examples? 2. I noticed that for different tasks, the “batch size” that works the best can be different (Figure 5, cd). Do you find any obvious patterns on which types of data/tasks prefer a smaller “batch size” and vice versa?

Code & Models

Repositories

Videos

9 AI Developments: HeyGen 2.0 to AjaxGPT, Open Interpreter to NExT-GPT and Roblox AI· youtube

Large Language Models as Optimizers· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsLinear Regression