Prompt Optimization with Human Feedback
Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See-Kiong Ng, Patrick, Jaillet, Bryan Kian Hsiang Low

TL;DR
This paper introduces APOHF, a new algorithm for optimizing prompts for large language models using only human preference feedback, inspired by dueling bandits, and demonstrates its efficiency across various tasks.
Contribution
The paper proposes APOHF, a theoretically grounded method for prompt optimization with human feedback, addressing the challenge of lacking numeric scores for prompt evaluation.
Findings
APOHF efficiently finds good prompts with few preference feedbacks.
APOHF applies successfully to instruction tuning, text-to-image models, and response refinement.
The method outperforms baseline approaches in prompt optimization tasks.
Abstract
Large language models (LLMs) have demonstrated remarkable performances in various tasks. However, the performance of LLMs heavily depends on the input prompt, which has given rise to a number of recent works on prompt optimization. However, previous works often require the availability of a numeric score to assess the quality of every prompt. Unfortunately, when a human user interacts with a black-box LLM, attaining such a score is often infeasible and unreliable. Instead, it is usually significantly easier and more reliable to obtain preference feedback from a human user, i.e., showing the user the responses generated from a pair of prompts and asking the user which one is preferred. Therefore, in this paper, we study the problem of prompt optimization with human feedback (POHF), in which we aim to optimize the prompt for a black-box LLM using only human preference feedback. Drawing…
Peer Reviews
Decision·Submitted to ICLR 2025
* This paper presents a novel approach to prompt optimization by using human preference feedback alone, which is suitable for black-box LLMs. * This paper is generally clear in its method description, though certain theoretical justifications could be expanded for a more rigorous understanding of APOHF’s design choices. * The algorithm design balances prompt utility prediction with exploration, showing reliable performance across varied tasks.
* The method employs the Bradley-Terry-Luce (BTL) model to represent human preference feedback, which assumes consistency and transitivity in user preferences. This may oversimplify human feedback, especially in real-world applications where user preferences can be inconsistent or influenced by context. The reliance on binary feedback might also limit the granularity of information available to the model, potentially leading to suboptimal prompt choices. * The method relies on a user-provided i
- Inspired by the concept of RLHF and Dueling Bandits, the authors propose a prompt optimization framework leveraging human feedback. The experiment results show that the proposed method is indeed effective to a certain extent with actual demonstrations. - With the Bandit-fashioned prompt selection strategy, powered by a trained NN (MLP) model as the performance predictor, the proposed framework determines the prompt efficiently in terms of the number of required human feedback instances. -
- Limited Comparison with Prompt Optimization Baselines: The authors' decision to exclude comparisons with state-of-the-art prompt optimization methods (e.g. TextGrad), citing the lack of a scoring method, may be overly restrictive. While direct human scoring might not be feasible, surrogate evaluation metrics (e.g., embedding-similarity between LLM output and ground truth) could serve as viable alternatives for these comparisons. Given the superior performance of existing prompt optimization
The paper is well written and easy to follow The motivation is clear compared to previous method. [which may not be practical as per the weakness part]
The main weakness is the prompt optimization from human feedback itself may not be practical. Let me explain the reasons. 1) The advantage or the main efforts of LLM is to improve its instruction following abilities, I.e., to cover as many prompts as possible. Therefore, when human in the loop, the key efforts should be in the zero-shot/few-shot abilities. If you expect the user to give feedback to the prompt provided by him/her-self, it may negatively impact the feasible application of this LL
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · AI-based Problem Solving and Planning
