DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

Tao Tao; Guanghui Zhu; Lang Guo; Hongyi Chen; Chunfeng Yuan; Yihua Huang

arXiv:2510.18257·cs.CL·October 22, 2025

DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

Tao Tao, Guanghui Zhu, Lang Guo, Hongyi Chen, Chunfeng Yuan, Yihua Huang

PDF

Open Access 3 Reviews

TL;DR

DelvePO is a flexible, self-evolving prompt optimization framework that decouples prompt components, uses working memory to guide prompt generation, and outperforms previous methods across diverse tasks and models.

Contribution

It introduces a task-agnostic, direction-guided self-evolving framework for prompt optimization that improves transferability and stability over existing approaches.

Findings

01

DelvePO outperforms SOTA methods on various tasks.

02

It demonstrates high transferability across different LLMs.

03

The framework effectively alleviates prompt instability.

Abstract

Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose $DelvePO$ ( $D$ irection-Guid $e$ d Se $l$ f-E $v$ olving Framework for Fl $e$ xible $P$ rompt $O$ ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- Decomposing prompts into interpretable components is valuable for understanding what makes prompts effective - Testing across multiple LLMs and domains demonstrates effort to validate generalizability - Detailed appendices with all prompts used enhance transparency - The working memory design that stores both component-level and prompt-level information is sensible

Weaknesses

- The core contributions are incremental improvements over existing evolutionary prompt optimization methods - Lack of significance testing and inconsistent use of random seeds weakens confidence in reported improvements - The framework requires extensive prompt engineering (Sub-tasks I-II, Sub-solutions I-II, multiple scenarios) that may limit adoption - Practical Limitations: 1. Higher computational costs than baselines 2. Requires predefined component types that may not transfer across domai

Reviewer 02Rating 4Confidence 4

Strengths

* DelvePO achieves better performance compared with previous baselines and ablation study shows the effectiveness of each component in the method. * This paper is well-written and the motivation is clear, meaningful.

Weaknesses

* Datasets and tasks selected are classical, relatively easy tasks for LLMs and these are not difficult for current strong LLMs anymore. I'm curious about the performance of DelvePO on more challenging and difficult tasks in LLM-era, like GSM8k, BBH, more reasoning tasks and so on. * This paper introduces memory and in essence, memory appears as concluding insights from last-generation prompts, which is a little far-fetched. OPRO[1] gives previous good-performing prompts and worse prompts to gu

Reviewer 03Rating 4Confidence 3

Strengths

The paper introduces a clear component-level prompt representation together with two explicit working memories (Component Memory and Prompt Memory), which turns otherwise highly stochastic LLM-based prompt mutation into a more controllable and reusable optimization process.

Weaknesses

1. The paper proposes a task-agnostic framework, but the initial component pool is manually collected and constructed from a wide range of related literature (line 116). This raises a question about the motivation of the method: does DelvePO truly make no strong task-specific assumptions and generalize to different tasks because of the framework design itself, or is the observed generality mainly due to the fact that a very comprehensive task component pool has been pre-collected and constructed

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Natural Language Processing Techniques