ProRefine: Inference-Time Prompt Refinement with Textual Feedback
Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M. Homan, Wei Wei

TL;DR
ProRefine is a novel inference-time prompt refinement method that uses textual feedback from LLMs to improve multi-step reasoning performance without additional training, significantly surpassing zero-shot baselines.
Contribution
It introduces an inference-time prompt optimization technique using an agentic loop of LLMs, enabling dynamic prompt refinement without extra training or labels.
Findings
Outperforms zero-shot Chain-of-Thought baselines by 3-37 percentage points.
Enables smaller models to approach larger model performance.
Improves accuracy and cost-effectiveness of multi-step reasoning tasks.
Abstract
Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications, and continue to fascinate researchers across fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. A training-free, label-free method, ProRefine improves reasoning at inference time with textual feedback, no requirement for fine-tuning; suitable for black-box LLMs. 2. Analysis showing how feedback quality affects performance: Comparing no verifier, verifier, and an optimal verifier shows the method’s upper bound and the centrality of verifier quality; the optimal verifier yields best results most of the times. 3. Paper positions ProRefine for on-demand use in hybrid systems and reports
The key challenge I have with the paper is that it's not well positioned to the current literature. This makes the novelty of the paper unclear. I would urge the authors to compare related SOTA and even evaluate them against ProRefine. The proposed approach has been applied by previous works and what is unique contribution in this paper is not clear. 1. The evaluations and comparison to SOTA is very weak. The authors compare to just Textgrad, there are several other prompt optimization techni
1. ProRefine presents a novel approach to prompt optimization during inference, distinguishing itself from existing methods by utilizing LLM-generated textual feedback for dynamic refinement. This innovative use of feedback not only enhances the reasoning capabilities of LLMs but also addresses the limitations of prior techniques that often rely on extensive training data or fixed prompts. 2. The authors provide a comprehensive evaluation of ProRefine across multiple reasoning tasks, demonstrat
1. The paper does not clearly state the additional cost and latency introduced by the proposed method. Please provide quantitative results or analysis to clarify this aspect. 3. The experiments primarily evaluate relatively small and weak open-source models (mainly LLaMA). It remains unclear whether the conclusions generalize to larger models (e.g., those exceeding 30B parameters) or to more capable closed-source models. 3. I also wonder whether these tasks might already be relatively easy for
- The progressive continuation trick (i.e., using $i*k$ tokens per round $i$) seems like an interesting to catch and correct mistakes early in generation; However, the effects of this are not ablated (see weaknesses) - ProRefine appears to work decently in a few settings, e.g., +21% on Word Sorting for Llama-3.1-8B.
- ProRefine relies on larger models to provide feedback and optimize the prompt. - The authors bring up that they are concerned with resource-constrained environments where querying a capable feedback model is feasible. But is this realistic? Under what scenario would a practitioner be able to call a larger more capable LLM on-demand, but *only for feedback*? >It is designed for resource-constrained environments where deploying the largest models for every query isn’t feasible, but temporary
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
