Meta-Prompt Optimization for LLM-Based Sequential Decision Making
Mingze Kong, Zhiyong Wang, Yao Shu, Zhongxiang Dai

TL;DR
This paper introduces EXPO and EXPO-ES algorithms that automatically optimize meta-prompts for LLM-based agents in sequential decision tasks, addressing the challenge of non-stationary rewards and significantly improving performance.
Contribution
The paper proposes novel algorithms, EXPO and EXPO-ES, for automatic meta-prompt optimization in LLM-based decision-making, handling non-stationarity effectively.
Findings
EXPO and EXPO-ES outperform fixed prompts in experiments.
Meta-prompt optimization improves LLM agent performance.
Algorithms adapt to non-stationary reward environments.
Abstract
Large language models (LLMs) have recently been employed as agents to solve sequential decision-making tasks such as Bayesian optimization and multi-armed bandits (MAB). These works usually adopt an LLM for sequential action selection by providing it with a fixed, manually designed meta-prompt. However, numerous previous works have found that the prompt has a significant impact on the performance of the LLM, which calls for a method to automatically optimize the meta-prompt for LLM-based agents. Unfortunately, the non-stationarity in the reward observations during LLM-based sequential decision-making makes meta-prompt optimization highly challenging. To address this challenge, we draw inspirations from adversarial bandit algorithms, which are inherently capable of handling non-stationary reward observations. Building on this foundation, we propose our EXPonential-weight algorithm for…
Peer Reviews
Decision·Submitted to ICLR 2026
Please see above
Please see above
1. The paper systematically introduces the problem of Meta-Prompt Optimization, emphasizing that unlike traditional prompt optimization methods that assume stationary reward distributions, the rewards in LLM-based agent environments are inherently non-stationary. This observation is highly insightful and effectively bridges the gap between prompt engineering and sequential decision making. 2. The paper integrates the adversarial bandit (EXP3) framework with LLM-based prompt search, proposing EXP
1. The comparisons do not include stronger baselines. There is no direct evaluation against more advanced RL-based LLM optimizers, such as PromptAgent. Some baselines, such as INSTINCT and MIPRO, are only tested on specific tasks, resulting in a lack of fair and unified experimental settings. 2. The paper lacks an analysis of the internal model dynamics, such as changes in embeddings or L2 divergences, which could provide deeper insight into the behavior of the optimization process. 3. The pape
1. The author uses a neural score estimator that takes text embeddings as input to estimate the reward of each meta-prompt. This allows the method to generalize better to unseen prompts and reduces the need for extensive exploration. 2. The author includes a comprehensive set of experiments on various tasks, demonstrating the effectiveness of the proposed method. The results show consistent improvements over existing baselines, indicating the robustness of the approach.
1. The author argues that the adversarial bandit formulation is more suitable for the meta-prompt optimization problem than the stochastic bandit formulation used in prior work. However, I am not fully convinced by this argument. The environments in the experiments do not appear to be highly adversarial, and it is unclear how much benefit the adversarial formulation provides in practice. We acknowledge that the environments may not be stable but it is not necessarily adversarial. More discussion
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Fault Detection and Control Systems · Cloud Computing and Resource Management
MethodsADaptive gradient method with the OPTimal convergence rate
