Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Chrisantha Fernando; Dylan Banarse; Henryk Michalewski; Simon; Osindero; Tim Rockt\"aschel

arXiv:2309.16797·cs.CL·October 2, 2023·27 cites

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon, Osindero, Tim Rockt\"aschel

PDF

Open Access 3 Repos 1 Video 3 Reviews

TL;DR

Promptbreeder is a novel self-referential system that evolves and improves prompts for large language models, outperforming existing strategies on reasoning and classification benchmarks by self-adapting its mutation mechanisms.

Contribution

It introduces a self-referential prompt evolution method that enhances prompt quality and domain adaptation for LLMs, surpassing prior prompt strategies.

Findings

01

Outperforms Chain-of-Thought and Plan-and-Solve prompting on reasoning benchmarks.

02

Effectively evolves prompts for hate speech classification.

03

Demonstrates self-improving mutation mechanisms for prompt optimization.

Abstract

Popular prompt strategies like Chain-of-Thought Prompting can dramatically improve the reasoning abilities of Large Language Models (LLMs) in various domains. However, such hand-crafted prompt-strategies are often sub-optimal. In this paper, we present Promptbreeder, a general-purpose self-referential self-improvement mechanism that evolves and adapts prompts for a given domain. Driven by an LLM, Promptbreeder mutates a population of task-prompts, and subsequently evaluates them for fitness on a training set. Crucially, the mutation of these task-prompts is governed by mutation-prompts that the LLM generates and improves throughout evolution in a self-referential way. That is, Promptbreeder is not just improving task-prompts, but it is also improving the mutationprompts that improve these task-prompts. Promptbreeder outperforms state-of-the-art prompt strategies such as Chain-of-Thought…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1: How to generate appropriate LLM prompt for the target downstream task is indeed a very important research question and still lacks a solid answer. I agree with authors that LLM are qualitiifly different from other deep learning models as they have the potential to self-improve their thinking process (e.g. self-generating prompts). 2: From the point of view of genetic algorithm, authors propose a comprehensive set of mutation stratergy, which includes certain extent of "self-referential'/self

Weaknesses

1: The major concern I have is whether evolution algorithm framework in general is not capable enough for the large prompt space for LLM. Overall from the examples provided by the authors in Figure3 appear to show not much different between prompts, which might suggest under-explored prompt space. Personally I feel certain level learning/gradient signal is needed to better explore and generate the prompts for complex LLM models. It will be very interesting (but not necessay) to have some compari

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

- PROMPTBREEDER showed promising performance and outperformed competing baseline approaches in 7 out of 8 tasks with a large margin. - The PROMPTBREEDER method was well elaborated in the manuscript and in the supplementary material.

Weaknesses

- PROMPTBREEDER aimed to concurrently refine both task and mutation prompts, considerably expanding the search space. However, the absence of navigation during each evaluation often resulted in unpredictable performance for the successive generated prompts. This is evidenced by the persistence of less effective prompts after extensive evaluations, as illustrated in Figure 3. - In light of the above, PROMPTBREEDER appears to rely on an extensive series of trial-and-error iterations to identify an

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

This paper proposes a systematic framework to evolve domain-specific prompts, and shows better results compared to other prompt strategies.

Weaknesses

1. The experiment is not extensive. In Table 1, the compared LLMs do not involve the most recognized models like gpt-3.5 or gpt-4, and the compared methods should contain CoT on PaLM 2-L. 2. The proposed method Promptbreeder still requires initial information for specific task (like description or mutation prompts), where worse initialization may lead to worse performance. This makes the method may not generalize to various tasks.

Code & Models

Repositories

Videos

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (Paper Explained)· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection