ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models
Qing Zhang, Bing Xu, Xudong Zhang, Yifan Shi, Yang Li, Chen Zhang, Yik Chung Wu, Ngai Wong, Yijie Chen, Hong Dai, Xiansen Chen, Mian Zhang

TL;DR
ELPO introduces an ensemble learning framework for automatic prompt optimization in large language models, significantly improving prompt quality and robustness across various tasks compared to existing methods.
Contribution
The paper presents a novel ensemble learning-based framework for prompt optimization, combining multiple strategies and voting to enhance performance and robustness over prior single-model approaches.
Findings
ELPO outperforms state-of-the-art methods in prompt optimization.
ELPO improves F1 score by 7.6 on ArSarcasm dataset.
ELPO demonstrates robustness across different tasks.
Abstract
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- ELPO provides a comprehensive pipeline integrating generation, search, and voting in a principled manner - The paper addresses prompt optimization from multiple angles (generation diversity, search efficiency, robustness via ensembling) - The focus on black-box optimization makes the approach applicable to commercial LLM APIs - Consistent improvements across diverse tasks (fact-checking, navigation, hate speech detection, sarcasm detection, reasoning) - The global tracking of difficult cas
- Lack of statistical testing, incomplete ablations, missing computational cost analysis - Why should this ensemble approach work? What are the theoretical guarantees? Under what conditions might it fail? - Embedding methods, optimization procedures, and exact algorithmic integration are underspecified - The approach requires running multiple generators, search strategies, and ensemble voting. Is the improvement worth this complexity? - Scalability questions: How does ELPO perform with: 1. Large
1. The central idea of applying ensemble learning principles to the entire APO pipeline (generation, search, and final prediction) is a good attempt for research. It directly addresses a well-known limitation of single-algorithm optimizers. 2. The method demonstrates significant and consistent performance gains over a wide range of strong, recent baselines across six diverse tasks. The improvements on challenging datasets like LIAR and BBH-navigate are particularly impressive. 3. The proposed
1. The framework's primary weakness is its significant complexity and likely computational cost. ELPO runs three generation algorithms, two search algorithms, and an additional weight optimization step. This is almost certainly more expensive (in terms of LLM calls to the optimizer) than the single-algorithm baselines. The paper claims "efficiency" and "conserves resources", but this is only relative to a naïve evaluation of all candidates, not relative to the baselines. The lack of a cost-per
* The method is clear to understand and this paper is well-written. * The proposed ELPO achieves remarkable scores on several datasets compared with baselines.
* The contributions are somehow incremental, ensembling previous prompt optimization methods with careful and further design in three phases: 1) prompt generation by the following methods: a) bad case reflection[1], b) evoprompt[2], c) OPRO; 2) search phrase: a) Bayesian, b) UCB[1], c) population[2,4]; 3) Voting: intuitive ensembling voting and aggregating. * The method is not very novel, just aggregation of previous methods. * The experimental setting including parameters are not given. How ma
The main strength of this paper is its significant performance improvement over existing prompt optimization methods. The proposed ELPO framework achieves consistently higher results across multiple benchmarks, clearly demonstrating its effectiveness in optimizing prompts for Large Language Models. By leveraging ensemble learning to combine diverse search and generation strategies, ELPO enhances the quality of discovered prompts and achieves state-of-the-art performance in various evaluation set
1. Unclear technical novelty and framing. The paper does not clearly articulate which components are genuinely novel versus straightforward extensions. While results suggest that incorporating diverse generators drives most of the gains, the manuscript does not clearly explain how these generators are integrated. Is there any critical design? 2. Unsupported claims about Bayesian Search + MAB. The introduction claims the method is the first to combine Bayesian optimization with MAB, but there is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Text and Document Classification Technologies
